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Preface 



EsTAL - Espana for Natural Language Processing - continued on from the 
three previous conferences: FracTAL, held at the Universite de Franch-Comte, 
Besangon (France) in December 1997, VexTAL, held at Venice International 
University, Ca Foscari (Italy), in November 1999, and PorTAL, held at the Uni- 
versidade do Algarve, Faro (Portugal), in June 2002. The main goals of these 
conferences have been: (i) to bring together the international NLP community; 
(ii) to strengthen the position of local NLP research in the international NLP 
community; and (iii) to provide a forum for discussion of new research and ap- 
plications. 

EsTAL contributed to achieving these goals and increasing the already high 
international standing of these conferences, largely due to its Program Commit- 
tee, composed of renowned researchers in the field of natural language processing 
and its applications. This clearly contributed to the significant number of papers 
submitted (72) by researchers from (18) different countries. 

The scope of the conference was structured around the following main topics: 
(i) (spoken and written language analysis and 

generation; pragmatics, discourse, semantics, syntax and morphology; lexical re- 
sources; word sense disambiguation; linguistic, mathematical, and psychological 
models of language; knowledge acquisition and representation; corpus-based and 
statistical language modelling; machine translation and translation aids; compu- 
tational lexicography), and (ii) 

(information retrieval, extraction and question an- 
swering; automatic summarization; document categorization; natural language 
interfaces; dialogue systems and evaluation of systems). 

Each paper was revised by three reviewers from the Program Committee or 
by external referees designed by them. All those who contributed are mentioned 
on the following pages. The review process led to the selection of 42 papers for 
presentation. They have been published in this volume. 

We would like to express here our thanks to all the reviewers for their quick 
and excellent work. We extend these thanks to our invited speakers, Walter 
Daelemans and Rada Mihalcea for their valuable contribution, which undoubt- 
edly increased the interest in the conference. We are also indebted to a number 
of individuals for taking care of specific parts of the conference program. Spe- 
cially, to Miguel Angel Varo who built and maintained all Web services for the 
conference. 

Finally, we want to thank the University of Alicante, and, specially, the Office 
of the Vice-President for Extracurricular Activities ( 

) and the Department of Software and Computing Systems ( 

) because of their support of 

this conference. 
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Adaptive Selection of Base Classifiers in 
One-Against-All Learning for Large 
Multi- labeled Collections 



Arturo Montejo Raez^, Luis Alfonso Urena Lopez^, and Ralf Steinberger^ 



^ European Laboratory for Nuclear Research, Geneva, Switzerland 
^ Department of Computer Science, University of Jaen, Spain 
® European Commission, Joint Research Centre, Ispra, Italy 



Abstract. In this paper we present the problem found when studying 
an automated text categorization system for a collection of High En- 
ergy Physics (HEP) papers, which shows a very large number of possible 
classes (over 1,000) with highly imbalanced distribution. The collection 
is introduced to the scientific community and its imbalance is studied 
applying a new indicator: the inner imbalance degree. The one-against- 
all approach is used to perform multi-label assignment using Support 
Vector Machines. Over-weighting of positive samples and S-Cut thresh- 
olding is compared to an approach to automatically select a classifier for 
each class from a set of candidates. We also found that it is possible to 
reduce computational cost of the classification task by discarding classes 
for which classifiers cannot be trained successfully. 



1 Introduction 

The automatic assignment of keywords to documents using full-text data is a 
subtask of , a growing area where Information Retrieval tech- 

niques and Machine Learning algorithms meet offering solutions to problems 
with real world collections. 

We can distinguish three paradigms in text categorization: the , case, 
the case and the case. In the binary case a sample either 

belongs or does not belong to one of two given classes. In the multi-class case a 
sample belongs to just one of a set of classes. Finally, in the multi-label case, 
a sample may belong to several classes at the same time, that is, classes are 
. In binary classification a classifier is trained, by means of supervised 
algorithms, to assign a sample document to one of two possible sets. These sets 
are usually referred to as belonging (positive) or not belonging (negative) samples 
respectively (the one-against-all approach), or to two disjoint classes (the one- 
against-one approach). For these two binary classification tasks we can select 
among a wide range of algorithms, including Naive Bayes, Linear Regression, 
Support Vector Machines (SVM) [8] and LVQ [11]. SVM has been reported to 
outperform the other algorithms. The binary case has been set as a base case from 
which the two other cases are derived. In multi-class and multi-label assignment, 
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the traditional approach consists of training a binary classifier for each class, 
and then, whenever the binary base case returns a measure of confidence on 
the classification, assigning either the top ranked one (multi-class assignment) 
or a given number of the top ranked ones (multi-label assignment). More details 
about these three paradigms can be found in [1]). We will refer to the ranking 
approach as the , strategy because inter-dependency is not taken into 

consideration. 

Another approach for multi-labeling consists of returning all those classes 
whose binary classifiers provide a positive answer for the sample. It has the 
advantage of allowing different binary classifiers for each class, since inter-class 
scores do not need to be coherent (since there is no ranking afterwards). Better 
results have been reported when applying one-against-one in multi-class classi- 
fication [1], but in our multi-label case this is not an option because any class 
could theoretically appear together with any other class, making it difficult to 
establish disjoint assignments. This is the reason why one-against-all deserves 
our attention in the present work. 

Although classification is subject to intense research (see [18]), some issues 
demand more attention than they have been given so far. In particular, problems 
relating to classification would require more attention. However, due 

to the lack of available resources (mainly multi-labeled document collections), 
this area advances more slowly than others. Furthermore, multi-label assignment 
should not simply be studied as a general multi-class problem (which itself is 
rather different from the binary case), but it needs to be considered as a special 
case with additional requirements. For instance, in multi-label cases, some classes 
are inter-related, the degree of imbalance is usually radically different from one 
class to the next and, from a performance point of view, the need of comparing 
a sample to every single classifier is a waste of resources. 



2 The Class Imbalance Problem 

Usually, multi-labeled collections make use of a wide variety of classes, resulting 
in an unequal distribution of classes throughout the collection and a high number 
of rare classes. This means not only that there is a strong imbalance between 
positive and negative samples, but also that some classes are used much more 
frequently than other classes. This phenomenon, known as the 

, is especially relevant for algorithms like the C4.5 classification tree [4, 3] 
and margin-based classifiers like SVM [16, 20, 7]. 

Extensive studies have been carried out on this subject as reported by Jap- 
kowicz [7], identifying three major issues in the class imbalance problem: 

. , and . Concept complexity refers 

to the degree of “sparsity” of a certain class in the feature space (the space 
where document vectors are represented). This means that a hypothetical clus- 
tering algorithm acting on a class with high concept complexity would establish 
many small clusters for the same class. Regarding the second issue, i.e. the lack 
of a significantly large training sets, the only possible remedy is the usage of 
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over-sampling when the amount of available samples is insufficient, and under- 
sampling techniques for classes with too many samples, e.g. just using a limited 
number of samples for training a SVM, by selecting those positive and negative 
samples that are close to each other in the feature space. The validity of these 
techniques is also subject to debate [4]. Finally, Japkowicz defines the degree of 
imbalance as an index to indicate how much a class is more represented over 
another, including both the degree of imbalance between classes (what we call 
) and between its positive and negative samples (what we 
call the ). Unfortunately, Japkowicz defined these values 

for her work towards the generation of an artificial collection and rewrote them 
later to fit specific problems regarding fixed parameters and the C5.0 algorithm, 
which make them difficult to manipulate. For these reasons, we cannot reuse her 
equations and propose here a variant focusing on the multi-label case. 

We define the of a certain class as a measure of the 

positive samples over the total of samples: 

ii = \l- 2m/n\ (1) 

where 

n is the total number of samples and 

Ui is the total number of samples having the class i in their labels. 

Japkowicz’ definition of imbalance degree 
helps in the generation of artificial distribu- 
tions of documents to classes. Its value does 
not lie within a defined range, which makes it 
difficult to manipulate and compare with the 
degree of other classes in different partitions. 
The value proposed in equation 1 is zero for 
perfectly balanced classes, i.e. when the num- 
ber of positive and negative samples are the 
same. It has a value of 1 when all samples are 
either positive or negative for that class. Its 
linear behavior is shown in figure 1 and, as 
we can see, it varies within the range [0,1]. 



A very suitable document set for multi-label categorization research is the HEP 
collection of preprints, available from the European Laboratory for Nuclear Re- 
search. Some experiments have been carried out using this collection ([13, 12]), 
and its interesting distribution of classes allows us to carry out a number of ex- 
periments and to design a new approach. An analysis of the collection has shown 
that there is the typical high level of imbalance among classes. If a given class 
is rarely represented in a collection, we can intuitively foresee a biased training 




Fig. 1. The linear ’imbalance de- 
gree’ fnnction 



3 The HEP Collection 
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that will yield classifiers with a low performance. It is clear that, if the collection 
were perfectly balanced, we could expect better categorization results, due to 
better learning. 

The hep-ex partition of the HEP collection is composed of 2802 abstracts 
related to experimental high-energy physics that are indexed with 1093 main 
keywords (the categories).^ Figure 2 shows the distribution of keywords across 
the collection. 




0 2D0 400 €09 SM) m m 



kefmafd 

(a) All classes 




keyword 

(b) 100 most frequent 



Fig. 2. Distribution of classes across documents in the hep-ex partition 



Table 1. The ten most frequent main keywords 
in the hep-ex partition 



As we can see, this parti- 
tion is very imbalanced: only 84 
classes are represented by more 
than 100 samples and only five 
classes by more than 1000. The 
uneven use is particularly no- 
ticeable for the ten most fre- 
quent keywords: In table 1 the 
left column shows the number 
of positive samples of a keyword 
and the right column shows 
the percentage over the total of 
samples in the collection. 

We can now study this col- 
lection applying the inner im- 
balance degree measure defined 

in equation 1. The two graphs in figures 3a and 3b show the inner imbalance 
degree for the main keywords in the hep-ex partition. We can notice how fast 



No. docs. 


Keyword 


1898 (67%) 
1739 (62%) 
1478 (52%) 
1190 (42%) 
1113 (39%) 
715 (25%) 
676 (24%) 
551 (19%) 
463 (16%) 
458 (16%) 


electron positron 

experimental results 

magnetic detector 

quark 

talk 

ZO 

anti-p p 
neutrino 
W 
jet 



^ We did not consider the keywords related to reaction and energy because they are 
based on formulae and other specific data that is not easily identifiable in the plain- 
text version of a paper. 
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the imbalance grows to a total imbalance degree of almost 1. When looking at 
the ten most frequent classes, we can see the effect of our degree estimation: 
classes 0 and 1 are more imbalanced than class 2, which gets the lowest degree 
of imbalance in the whole set of classes. It is due to the fact that, as shown by 
table 1, this class has almost the same number of positive and negative samples. 
From class 3 onwards, the imbalance then grows dramatically. 





(a) All classes (b) Ten most frequent 

Fig. 3. Imbalance degree of classes in the hep-ex partition 



When training binary classifiers for these keywords, we realized that the 
performance decreases strongly with growing imbalance degree. To correct doc- 
ument distribution across classes, we can use over-sampling (or under-sampling) 
or tune our classifiers accordingly. For example, for SVM we can set a cost fac- 
tor, by which training errors on positive samples out- weights errors on negative 
samples [14]. We will use this in our experiments. 



4 Balance Weighting and Classifier Filtering 

Some algorithms work better when, in the one-against-all approach, the number 
of positive samples is similar to the number of negative ones, i.e. when the class 
is balanced across the collection. However, multi-label collections are typically 
highly imbalanced. This is true for the HEP collection, but also for other known 
document sets like the OHSUMED medical collection used in the filtering track of 
TREC [5] , and for the document collection of the European Institutions classified 
according to the EUROVOC thesaurus. This latter collection has been studied 
extensively for automatic indexing by Pouliquen et. al. (e.g. [2]), who exploit a 
variety of parameters in their attempt to determine whether some terms refer to 
one class or to another in the multi-labeled set. 

The question now is how to deal with these collections when trying to apply 
binary learners that are sensitive to high imbalance degrees. We can use tech- 
niques like over-sampling and under-sampling, as pointed out earlier, but this 
would lead to an overload of non-informational samples in the former case, and 
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to the loss of information in the second case. Furthermore, concept complex- 
ity has also its effects on binary classifiers. We have not paid attention to this 
fact since it is out of the scope of the present study, but we should consider 
this to be yet another drawback for collections indexed with a large number of 
non-balanced classes. 

In our experiments we basically train a system using the battery strategy, but 
( ), we allow tuning the binary classifier for a given class by a balance factor, 
and () we provide the possibility of choosing the best of a given set of binary 
classifiers. At CERN, we intend to apply our classification system to 
environments so that a gain in classification speed is very important. Therefore, 
we have introduced a parameter a in the algorithm, resulting in the updated 
version given in figure 4. This value is a threshold for the minimum perfor- 
mance of a binary classifier during the validation phase in the learning process. 
If the performance of a certain classifier is below the value a, meaning that the 
classifier performs badly, we discard the classifier and the class completely. By 
doing this, we may decrease the recall slightly (since less classes get trained and 
assigned), but the advantages of increased computational performance and of 
higher precision compensate for it. The effect is similar to that of the 
proposed by Yang [21]. We never attempt to return a positive answer for rare 
classes. In the following, we show how this filtering saves us considering many 
classes without significant loss in performance. 

We allow over-weighted positive samples using the actual fraction of positive 
samples over negative ones, that is, the weight for positive samples (w+) is: 



«;+ = C_/C+ (2) 

where 

C_ is the total number of negative samples for the class 
C+ is the total number of positive samples for the class 

As we can see, the more positive documents we have for a given class, the 
lower the over- weight is, which makes sense in order to give more weight only 
when few positive samples are found. This method was used by Morik et al. 
[14] but they did not report how much it improved the performance of the 
classifier over the non-weighted scheme. As we said, this factor was used 
in our experiments to over-weight positive samples over negative ones, i.e. the 
classification error on a positive sample is higher than that of a negative one. 

We also considered the approach. The assignation of a sample as pos- 

itive can be tuned by specifying the decision border. By default it is zero, but 
it can be set using the S-Cut algorithm [21]. This algorithm uses as threshold 
the one that gives the best performance on an evaluation set. That is, once the 
classifier has been trained, we apply it against an evaluation set using as possible 
thresholds the classification values (the margin for SVM). The threshold that 
reported the best performance (the highest FI in our case) will be used. 
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Input: 

a set of multi-labeled training documents Dt 
a set of validation documents 
a threshold a on the evaluation measure 
a set of possible label (classes) L, 
a set of candidate binary classifiers C 
Output : 

a set C' = {ci, Cfc, c\l\} of trained binary classifiers 
Pseudo code: 

C" = 0 

for-each h in L do 
T = 0 

for-each Cj in C do 

train- classifier {cj, k, Dt) 

T = TU{c,} 
end-for-each 

Chest = best-classifier (T , Dfi) 

if evaluate- classifier (cbest) > ce 

C'=C'U {Chest} 
end- if 

end-for-each 



Fig. 4. The one-against-all learning algorithm with classifier filtering 

5 Experiments and Results 

5.1 Data Preparation 

The collection consists of 2967 full-text abstracts linked to 1103 main keywords. 
Each abstract was processed as follows: 

— Punctuation was removed 

— Every character was lowercased 
~ Stop words were removed 

— The Porter stemming algorithm [15] was applied 

— Resulting stems were weighted according to the TF.IDF scheme [17] 

After processing the collection in this way, we trained the system applying 
each strategy using the ^ package as the base binary classifier. We 

also filtered out classes not appearing in any document either in the training, 
validation or test sets, reducing the number of classes to 443.8 on average. Results 
are shown at the end of this section. 

For the evaluation in experiments, [9] was used 

in order to produce statistically relevant results that do not depend on the 

SVM- Light is available at http://svmlight.joachims.org/ 
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partitioning of the collection into training, validation and test sets. Extensive 
experiments have shown that this is the best choice to get an accurate estimate. 
The measures computed are and . The Fi measure (introduced 

by Rijsbergen [19] a long time ago) is used as an overall indicator based on the 
two former ones and is the reference when filtering is applied. Also 
and measurements are given for later discussion. The final values are com- 
puted using macro-averaging on a per-document basis, rather than the usual 
micro-averaging over classes. The reason is, again, the high imbalance in the 
collection. If we average by class, rare classes will influence the result as much 
as the most frequent ones, which will not provide a good estimate of the perfor- 
mance of the multi-label classifier over documents. Since the goal of this system 
is to be used for automated classification of individual documents, we consider 
it to be far more useful to concentrate on these measurements for our evaluation 
of the system. More details about these concepts can be found in [18], [10] and 
[ 22 ]. 

5.2 Results 

Table 2 shows the results of ten runs of our multi-label classifier with different 
configurations. The highest values of Fi are reached when letting the system 
choose among fixed values for over- weighting positive samples (2, 5, 10 and 20). 
These are the results when applying the algorithm of figure 4 with a = 0.0, i.e. 
no filtering over classifiers is done. 



Table 2. Results of experiments using SVM 



Experiment 


Precision 


Recall 


FI 


Accuracy 


Error 


% of classes covered 


No weight 


74.07 


33.96 


43.92 


98.23 


1.77 


33.96 


No weight / Scut 


74.26 


34.44 


44.38 


98.24 


1.76 


99.95 


Overweight 20 


51.47 


45.84 


46.50 


97.71 


2.29 


57.32 


Auto weight 


58.10 


44.39 


48.09 


97.94 


2.06 


58.09 


Overw. 2,5,10,20 / Scut 


71.74 


39.92 


48.47 


98.25 


1.75 


100.00 


Auto weight / Scut 


58.03 


45.30 


48.56 


97.89 


2.11 


99.82 


Overweight 2 


70.74 


40.45 


48.78 


98.21 


1.79 


53.36 


Overweight 5 


64.56 


43.57 


49.40 


98.11 


1.89 


57.19 


Overweight 10 


62.30 


45.22 


50.14 


98.08 


1.92 


57.30 


Overw. 2,5,10,20 


65.89 


44.59 


50.53 


98.17 


1.83 


57.53 



We see that the top recall reached does not imply having more classes trained. 
Therefore we may want to study how we can reduce the number of classes trained 
to speed up the classification process without loosing too much in performance. 
For that purpose, we experimented with different values of a, as shown in tables 
3 and 4. 
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Table 3. Results of experiments using multi-weighted SVM with filtering 



a 


0.0 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


Precision 


65.89 


70.04 


70.41 


70.88 


71.90 


71.96 


71.02 


67.96 


Recall 


44.59 


44.49 


43.95 


42.95 


40.54 


36.65 


31.80 


23.02 


Fi 


50.53 


51.59 


51.32 


50.77 


49.21 


46.11 


41.70 


32.83 


Accuracy 


98.17 


98.25 


98.25 


98.25 


98.24 


98.21 


98.15 


98.03 


Error 


1.83 


1.75 


1.75 


1.75 


1.76 


1.79 


1.85 


1.97 


% classes trained 


57.53 


56.49 


50.81 


43.20 


32.73 


23.23 


16.00 


8.58 



Table 4. Results of experiments using auto-weighted S-Cut thresholded SVM with 
filtering 



a 


0.0 


0.1 


0.2 


0.3 


0.4 


0.5 


0.6 


0.7 


Precision 


58.03 


62.47 


64.84 


67.45 


69.47 


71.19 


71.14 


68.24 


Recall 


45.30 


45.04 


44.83 


44.24 


42.76 


39.59 


34.43 


24.88 


Fi 


48.56 


49.93 


50.47 


50.75 


50.27 


48.37 


44.10 


34.76 


Accuracy 


97.89 


98.06 


98.14 


98.20 


98.23 


98.22 


98.17 


98.05 


Error 


2.11 


1.94 


1.86 


1.80 


1.77 


1.78 


1.83 


1.95 


% classes trained 


99.82 


85.30 


77.10 


68.47 


55.74 


42.34 


30.82 


16.72 



5.3 Analysis 

Interesting conclusions can be drawn from the tables above. The first thing we 
notice is that recall is low compared to precision. This is normal if we consider the 
existence of rare and, therefore, difficult-to-train classes. When tuning our multi- 
label classifier, we see that variations in precision are more representative than for 
recall. The F\ measure remains quite stable: throughout all the experiments with 
different configurations, the most we gain is 6.61%. However, a very important 
result is that, even when some configurations are able to train up to 100% of the 
total of classes involved (we can see how the percentage of classes successfully 
trained varies widely), it does not influence that much the overall performance 
of the classifier. We can conclude that . This 

is the reason for the design of our filtering algorithm. Furthermore, it is not clear 
that S-Cut and auto-weighting strategies are so relevant for our data. As we can 
also notice, accuracy and error are not very sensitive to the variations of our 
parameters, but this is again due to imbalance: most of the classes are rare and 
for the most frequent ones we get high precision and recall, even with not very 
sophisticated configurations. 

When discarding classes, we obviously gain in precision and, despite more 
classes not being trained, we do not lose that much in recall. The result is a 
better Fi than without discarding, as shown by FI values in 2 compared to 
those of tables 3 and 4. We can see how strongly we can reduce the number of 
classes without affecting significantly the overall performance of the multi-label 
classifier. Figures 5a and 5b visualize the behavior described. The bigger our a 
is, the more classes are discarded. From all the test runs, the best value of F\ was 
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Fig. 5. Influence of Altering on (a) multi-weighted SVM and (b) auto-weighted with 
S-cut thresholding 



obtained with an a value of 0.1 and using candidate classifiers with over-weights 
2, 5, 10 and 20 for positive classes. From the graphs we can see that increasing 
a yields to a higher precision up to a maximum from which the threshold will 
be so restrictive that even good classifiers are discarding and, therefore, the 
precision starts to decrease accordingly. Thus, our choice of a will depend on 
our preference of precision over recall and our need of reducing classes for faster 
classification. If we are able to discard non-relevant (rarely used) classes, we can 
almost maintain our performance classifying against a lower number of classes. 



6 Conclusions and Future Work 

We have presented a new collection for multi-label indexing. The hep-ex par- 
tition can be obtained by contacting the authors. A calculus for measuring the 
imbalance degree has been proposed, along with a study of the overweight of 
positive classes on this collection using SVM and the application of S-Cut. The 
results show that this is a relevant issue, and that an imbalance study of any 
multi-label collection should be carried out in order to properly select the base 
binary classifiers. Another promising issue would be to work on other aspects of 
imbalance like . [6]. We have started investigating this topic 

by working with “concepts” rather than with terms in order to reduce the term 
space. By doing this, we would cover the main drawbacks of imbalanced collec- 
tions. 

Filtering by classification thresholding is very effective to reduce the number 
of classes involved in multi-label classification. Without forcing expensive tuning 
of the threshold, we propose to provide a range of a values and let the algorithm 
choose the classifier with the best behavior. 

One of the disadvantages using the battery approach is its computational 
cost, since we have to launch every classifier for a sample. However, SVM is 
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quite selective, not being trainable in many cases, discarding in this way many 
conflictive classes. This reduces the computation without loosing too much in 
performance. We have shown that, by increasing the selectivity, we can even gain 
significantly in precision without loosing too much in recall. 

One multi-label collection issue we have not considered is inter-class depen- 
dency. In some preliminary analysis we found that the correlation among classes 
is relevant enough to be considered. We could actually benefit from such a cor- 
relation to speed up the classification process, by discarding those classes not 
correlated to the ones we have already found relevant. This relation could proba- 
bly be used to fight one of the drawbacks found: our recall is very low compared 
to the precision. If we were able to select those classes that are highly correlated 
with classes assigned with high precision, we might gain in recall. This will need 
further investigation. 
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Abstract. In our research, we have developed a transfer-based machine 
translation architecture for the translation from Japanese into German. 
One main feature of the system is the fully automatic acquisition of trans- 
fer rules from translation examples by using structural matching between 
the parsing trees. The translation system has been implemented as part 
of a language learning environment with the aim to provide personalized 
translations for the students. In this paper we present our formalism to 
represent syntactic and transfer knowledge, and explain the various steps 
involved in acquiring and applying transfer rules. 



1 Introduction 

The main aim of our research is the development of a machine translation sys- 
tem, which produces high quality translations from Japanese into German. A 
second important requirement is full customization of the system because, in 
our opinion, there exists no “perfect” translation but only a preferred one for 
a certain user. Therefore, the post-editing of a translation should result in an 
automatic update of the translation knowledge. 

Furthermore, we had to consider the two constraints that we had neither 
a large Japanese-German bilingual corpus nor resources to manually build a 
large knowledge base available. In any case, a large handcrafted knowledge base 
is in conflict with our need for flexible adaptation, and the insufficient data 
quality of today’s large corpora interferes with our demand for high quality 
translations. 

In our approach we use a transfer-based machine translation architecture (for 
good overviews of this topic see [1, 2, 3]). However, we learn all the transfer rules 
incrementally from translation examples provided by a user. For the acquisi- 
tion of new transfer rules we use structural matching between the parsing trees 
for a Japanese-German sentence pair. To produce the input for the 

we first compute the correct segmentation and tagging, and then 
transform the token lists into parsing trees. For the translation of a sentence, 
the applies the transfer rules to the Japanese parsing tree to 

transform it into a corresponding German parsing tree, from which we generate 
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a token list and surface form. In this paper we focus on the acquisition and 
transfer components; for a description of the other system modules we refer to 

[4]. 

We have developed the machine translation system as part of 

[5]. PETRA is a lan- 
guage learning environment, which assists German-speaking language students 
in reading and translating Japanese documents, in particular, educational texts. 
It is fully embedded into Microsoft Word so that the students can invoke all 
the features from within the text editor. The incremental improvement of the 
translation quality encourages a bidirectional knowledge transfer between the 
student and the learning environment. Besides the translation features, PETRA 
offers the access to the large Japanese-German dictionary WaDokuJT and a 
user-friendly interface to add new dictionary entries. PETRA has been imple- 
mented using Amzi! Prolog, which provides full Unicode support and an API to 
Visual Basic for the communication with Microsoft Word. 

The rest of the paper is organized as follows. We first introduce our formalism 
to represent the parsing trees and transfer rules in Sect. 2 and 3 before we 
describe the principal steps for the acquisition and application of the transfer 
rules in Sect. 4 and 5. 



2 Parsing Trees 



The parsing trees represent the input to the acquisition component. Instead of 
using a fixed tree structure we decided on a more flexible and robust representa- 
tion. We model a sentence as a set of constituents (represented as list in Prolog). 
Each constituent is a compound term of arity 1 with the constituent name as 
principal functor. Regarding the argument of a constituent we distinguish two 
different constituent types: representing features or words, 

and representing phrases as sets of subconstituents. 

This representation is compact because empty optional constituents are not 
stored explicitly, and is not affected by the order of the different subconstituents 
in the arguments of complex constituents. The latter is essential for a robust and 
effective application of transfer rules (see Sect. 4). Figure 1 shows an example 
of a Japanese parsing tree. The ta-form of the main verb indicates English past 
tense, expressed as perfect tense in German. 

For the efficient traversal, processing, and manipulation of the arguments 
of complex constituents we have implemented several generic predicates. In the 
following, we list just those predicates that are used later on in the paper: 



( ,A,. ) : searches for subconstituent ( ) in A, fails 

if (_) ^ A; 

( ,A1,. , A2) : replaces (_) G A1 with (. ) result- 
ing in A2; if (_) ^ Al, (. ) is inserted as additional subcon- 

stituent; 

_ . ( 1, 2,A1,, ,A2) : same as except that also the 

constituent name is changed to 2; 
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Das Buch in seiner heutigen Form ist im Mitteialter zum ersten Mai aufgetreten. 
The book in its present form appeared in the Middle Ages for the first time. 



[hew(fcbt5tt-?)/ver), 

hwf(vta), 

pav(li Ei6T/adv), 
adp([hew(43-tt/nou), 
php((-^-oT/par)]), 
sub([hew(:^/nou), 
anp([hew(llJ/nou), 
anp([hew(L'^/nou)])])])] 



head word - arawareru/verb - to appear 
head word form - ta-form 

predicative adverb - hajimete/adverb - for the first time 

adverbiai phrase - head word - chuusei/noun - Middie Ages 

phrase particie - ninatte/particle - in 

subject - head word - hon/noun - book 

attributive noun phrase - head word - katachi/noun - form 

attributive noun phrase - head word - ima/noun - present 



Fig. 1. Example of Japanese parsing tree 



(A, A1,A2) : unifies all subconstituents of Al with the corresponding 
subconstituents in A and computes A2 = A \ Al (used for applying transfer 
rules with shared variables for unification, see Sect. 3.2). 



3 Transfer Rules 

The transfer rules are stored as in the rule base. We have defined several 
for the different types of rules. Therefore, when we talk about rules 
in the following, we always refer to transfer rules for machine translation in the 
general sense, not to logical rules in the strict sense of Prolog. 

One characteristic of our transfer rules is that we can cover most translation 
problems with a very small number of generic abstract predicates. For some com- 
mon problems we have defined some additional specialized predicates. However, 
all of these specialized predicates could also be expressed by using our generic 
predicates. We introduce them merely to increase the compactness of our rule 
base and the efficiency of rule application. 

In the next subsections we give an overview of the different rule types along 
with illustrative examples. For the ease of the reader we use Roman transcription 
for the Japanese examples instead of the original Japanese writing. 

3.1 Rules for Translating Simple Constituents 

For simple context-insensitive translations at the word level, the argument of a 
simple constituent Al is changed to A2 by the predicate: 

tr_asc{Al,A2). (1) 

The default transfer rule to translate the Japanese noun HON (book) 
into the German counterpart Buch is stated as the fact: 



tr_asc{nON/nou, ’Buch ’/now). 
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This general fact is then supplemented with more specific context-sensitive 
facts to handle different word meanings in other contexts. We have included the 
word category in the fact specification so that we can easily model the situation 
where a translation changes the word category. 

The Japanese adjectival noun ONAJI (same) is translated into the 
German adjective gleich: 

tr_asc(ONAJi/ano, gleich/adj). 

More complex changes that also affect the constituent itself can be defined 
by using the predicate: 

tr_sc{Cl,C2,Al,A2). (2) 

A fact for this predicate changes a simple constituent C'l(Al) to C2{A2), i.e. 
constituent name and argument are replaced. 

The Japanese attributive suffix ( ) dake (only) is expressed as 

attributive adverb ( ) nur in German: 

tr_sc( ,aau,DAKE/ ,nur/a<iu). 

This predicate can also be used in situations where a Japanese word is just 
expressed as syntactic feature in German. 

The Japanese attributive adverb MOTTOMO (most) corresponds to 
the superlative degree of comparison ( ) of an adjective in German: 

trsc{aav, com, MOTTOMO/adu, sup). 

The second constituent C2 is not restricted to a simple constituent, it can 
also be a complex constituent. This way we can model translations of simple 
words into whole phrases. 

The Japanese predicative adverb hajimete (for the first time) 
is translated into the German adverbial phrase zum ersten Mai: 

trsc{pav, adp, HAJiMETE/adu, 

[php^zn/prp), det{ ),num{sng), seq{erst/ord), deru( ’Mai ’ /nou)]). 

The adverbial phrase is specified as set of five subconstituents: phrase 
particle zu/preposition, determiner type definite, number singular, sequence 
erst/ordinal numeral, and head word Mal/noun. Please note that zum is a con- 
traction of the preposition zu and the definite article dem for the dative case. 

Finally, for modelling situations where there exist different translations for a 
Japanese word depending on the constituent name, we also provide a shortcut 
for trsc{C, C, Al, A2)-. 



tr_scn{C, Al, A2). 



(3) 
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3.2 Rules for Translating Complex Constituents 

Just as in the case of simple constituents, the following predicate enables the 
substitution of arguments for complex constituents: 

tr.acc{ ,Reql,Req2). (4) 

Facts for this predicate change the argument of a complex constituent from 
A1 = Reql U Add to A2 = Req2 U Add if hew{ ) G Al. The head word 
serves as index for the fast retrieval of matching facts and the reduction of the 
number of facts that have to be further analyzed. The application of a transfer 
rule requires that the set of subconstituents in Reql is included in an input 
constituent to replace Reql by Req2. Besides Reql any additional constituents 
can be included in the input, which are transferred to the output unchanged. 
This allows for a flexible and robust realization of the transfer module (see 
Sect. 5) because one rule application changes only certain aspects of a constituent 
whereas other aspects are translated by other rules in subsequent steps. 

The Japanese adverbial phrase CHUUSEI NINATTE (in the Middle 
Ages) is translated into the German adverbial phrase im Mittelalter: 

tr_acc(CHUUSEi/noM, [p/ip(NiNATTE/par), /iew(CHUUSEi/now)] , 
[php{±n/prp),det{ ),num{sng), /iew( ’Mittelalter ’ /now)]). 

Because Al and Reql are sets of constituents, the order of the subconstituents 
must not influence the matching of the two sets. Therefore, by applying the 
predicate (see Sect. 2) we retrieve each element of Reql in Al to create a 
list of constituents Reqlg in the same order as in Reql and then try to unify 
the two lists. As a byproduct of this sorting process we obtain the set difference 
Add = Al \ Reql as all remaining constituents in Al that were not retrieved. 

The expression Ji O kaku (literally to write characters) in the 
Japanese verb phrase katamen NI ji O kaku (to write on one side) is re- 
placed by the verb beschreiben within the corresponding German verb phrase: 

(kaku/ ,[ (kaku/ ), ([ (ji/ )])],[ (beschreiben/ )]). 

Al = [ ([ (JI/ )]), ( ), (kaku/ ), 

([ (ni/ ), (katamen/ )])] 

Reqls = [ (kaku/ ), ([ (ji/ )])] 

= [ ( ), ([ (ni/ ), (katamen/ )])] 

A2 = [ (beschreiben/ ), ( )>[ ([ (ni/ ), (katamen/ )])] 

The rule specifies that for any complex constituent with head word kaku and 
direct object ( ) Ji these two subconstituents are replaced by the head word 

beschreiben. In this example the input Al for the application of this rule is a 
verb phrase with direct object Ji, head word form (verb dictionary form), 
head word kaku, and adverbial phrase katamen ni. Reqls is extracted from 
Al in the correct order, leaving Add as list of remaining subconstituents. The 
result A2 is then formed by appending the list Add to Req2. 
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The expressiveness of this formalism is increased decisively by using shared 
variables for unification within the facts. This makes it possible to change certain 
parts of subconstituents and leave other parts intact. It also allows to define 
general rules that can be overwritten by more specific rules. 

The following rule states that a verb phrase with head word tsuna- 
GIAWASERU and any direct object X (to put together from X) is translated 
into a verb phrase with head word zusammenfiigen and the prepositional object 
consisting of the phrase particle aus (from), number plural, determiner type 
indefinite, and X\ 

tr_acc(TSUNAGiAWASERu/?;er, [/iew(TSUNAGiAWASERu/wer), do5(X)], 
[/iew(zuscmimenf ligen/uer) , pob{ [p/ip(aus /prp ) , det{ind ) , num{plu)\X]) . 

If this transfer rule is applied to the verb phrase papirusu O nanmai mo 
TSUNAGIAWASETA (put together from several sheets of papyrus), we ob- 
tain: 

Reql = [ (tsunagiawaseru/ ), (X)] 

Req2 = [ (zusammenfiigen/ ), ([ (aus/ ), ( ), ( )|X] 

A1 = [ ([ (papirusu/ ), ([ (mai/ ), (mo/ ), 

(nan/ )])]), ( ), (tsunagiawaseru/ )] 

Reqls = [ (tsunagiawaseru/ ), ([ (papirusu/ ), 

([ (mai/ ), (mo/ ), (nan/ )])])] 

= [ ( )] 

A2 = [ (zusammenfiigen/ ), ([ (aus/ ), ( ), ( ), 

(papirusu/ ), ([ (mai/ ), (mo/ ), (nan/ )])]), 

( )] 

As can be seen, the variable X is bound to the head word papirusu and 
the complex constituent expressing a quantity. The quantity expression 
(several sheets) consists of the head word mai (counter for thin objects like 
sheets), the phrase particle MO (also), and the interrogative pronoun ( ) nan 

(what). 

One important use of this predicate is the translation of Japanese postpo- 
sitional objects into corresponding German prepositional objects because the 
choice of the German preposition depends in most cases on the main verb and 
the postposition. 

As for the case of a simple constituent, we also provide a predicate to not only 
change the argument of a complex constituent but also the constituent name: 

tr_cc{Cl,C2, ,Reql,Req2). (5) 

This changes a complex constituent C\{A1) to C2{A2 ) . A1 is defined as union 
of a set of required subconstituents Reql and a set of optional subconstituents 
Opt: A1 = Reql U Opt. A2 is then computed as union of the translation Req2 
of the required subconstituents and Opt: A2 = Req2 U Opt. Again, is used 
to speed up the rule access. Different from the unrestricted set Add in (4), Opt 
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is limited to certain optional constituents, which do not change the translation 
of the rest of the phrase. 

The adverbial phrase katamen ni (on one side) introduced in 
Example 7 is translated as predicative adjectival phrase ( ) with head word 

einseitig and comparison positive. An additional attributive suffix bake (only) 
as in KATAMEN BAKE NI is transferred unchanged to be translated in a second 
step as shown in Example 3: 

tr-cc{adp,pap, katamen / nou, [php{Ni/ par) , hew {katamen / non)], 
[hew{e±nse±t±g/ adj) , com{pos)]) . 

A1 = [php{m/par), hew{KATAMEN /non), (bake/ )] 

Reqls = [php{Ni/par), hew {katamen / non)] 

Opt = [ (bake/ )] 

A2 = [/iew(einseitig/a(ij), com(pos), (bake/ )] 

Of course, also facts for predicate (5) can contain shared variables for unifi- 
cation to enable the acquisition of general transfer rules. 

The Japanese adjectival noun nita together with a comparative 
phrase ( ) X forms an attributive adjectival phrase ( ) (being similar 

to X) that is translated into a relative clause ( ) with head word ahneln (to 

resemble) in present tense and an indirect object ( ) X: 

tr_cc{aap, rcl, NiTA/ano, [hew{NiTA/ano), cmp{X)], 
[/iew(ahneln/ver), ten( ),iob{X)]. 

3.3 Rules for Translating Conjunctions 

German (or English) conjunctions are expressed in Japanese mainly with the 
help of conjunctive particles. However, the translation of a conjunctive particle 
often depends on the constituent name of the complex constituent in which it is 
included, i.e. the phrase type. 

Therefore, we provide the following predicate for the definition of general 
transfer rules for situations where the argument Al of a simple constituent is 
only translated to A2 if the constituent name of the complex constituent in which 
it is included equals : 

trsci{ ,A1,A2). (6) 

The default transfer rule to translate the Japanese conjunctive par- 
ticle TO (and) for combining a noun phrase with a coordinated noun phrase ( ) 

is formulated as the fact: 



trsci{cnp, TO/por, und/con). 

However, when TO is used to combine a clause with a preceding clause ( ), 

the meaning changes to the conditional sense wenn (if, when): 



trsci{pcl, TO /par, wenn/con). 
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One particular characteristic of Japanese grammar is that there exist certain 
verb forms with conjunctional meaning to combine two clauses, e.g. the te-form. 
For this frequent situation we have defined the following predicate, which chooses 
a conjunction to insert into the preceding clause depending on the verb form and 
the head words of the two clauses: 

- ( , 1, 2, ). (7) 

This predicate is just provided for reasons of convenience and efficiency, it 
can also be realized with the help of predicate (4) applied to the main clause: 

tr_acc{ 2,[ ( 2), ([ ( 1), ( )|X])], 

[ ( 2), ([ ( 1), ( ),pM )|X])]). 

3.4 Rules for Translating Syntactic Features 

One of the main problems with translating Japanese into German is the large 
discrepancy between these two languages with respect to their explicitness in ex- 
pressing syntactic features at the surface level. German grammar has a complex 
system of declensions and conjugations to express number, gender, case, tense, 
mood, voice, etc. Japanese, however, is highly ambiguous regarding most of these 
features: it has no declension at all, does not distinguish between singular and 
plural, has no articles, and indicates only two tenses through conjugation. To 
bridge these two very different representations of linguistic knowledge, we have 
to find the right values for all the syntactic features required for the generation 
of the surface form of a German sentence. 

Maybe the most difficult problem is to determine the referential properties of 
a Japanese noun phrase in order to generate the correct values for the syntactic 
features number and determiner type. In our translation model we use default 
values wherever possible, which are overwritten to cover special cases. To model 
the general situation that the determiner type and the number of a constituent 
with name C depend on its head word and on the head word of the constituent 
in which it is included, we provide the predicate: 

tr-dn{C, 1, 2, , ). (8) 

In the expression himo O toosu (to thread a lace) the direct 
object is an indefinite singular noun phrase: 

tr-dn{dob, HiMO/noM, TOOSu/uer, ind, sng). 

As before, this rule is only a shortcut instead of writing: 

tvMcc{ 2,[ ( 2),C([ ( 1)|A])], 

[ ( 2),C([ ( 1), ( ), ( )|A])]). 

Rules of this type can again be overwritten by more specific rules. This way 
we can create a flexible hierarchy of rules reaching from the most general to the 
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most specific cases. For postpositional phrases the correct number and tense is 
determined in parallel with the translation of the postposition whenever possible. 

Syntactic features regarding verbs in verb phrases, i.e. the correct tense, voice, 
mood, etc., are mainly derived from the verb form. For this purpose we provide 
a predicate to translate a verb form into a list of syntactic features: 

tr- .{ , ). (9) 

Except for the tense we use default values for the syntactic features and 
indicate only different values to keep the German parsing tree compact. As the 
information derived from the conjugation of the main verb is often ambiguous, 
in many cases the acquisition of additional, more specific rules is necessary. 



4 Acquisition of Transfer Rules 

For the automatic acquisition of new transfer rules from Japanese-German sen- 
tence pairs, we first compute both parsing trees as input to the acquisition 
component. The acquisition algorithm traverses both syntax trees in a top-down 
fashion. We start the search for new rules at the sentence level before we look for 
corresponding subconstituents to continue the search for finer-grained transfer 
rules recursively. The matching algorithm performs a complete traversal of the 
parsing trees, i.e. rules are learnt even if they are not needed for the translation 
of the Japanese sentence in order to extract as much information as possible 
from the example. 

In addition to the predicates we have already described in Sect. 2 we use 
the predicate , _ _ ( ,A1,A2,, 1,, 2) to search for the subcon- 
stituent with name in A1 and A2, and retrieve the arguments of ( 1) 

and ( 2); l=nil if (_) ^ Al, 2=nil if (_) ^ A2. 

In the following we give an overview of the steps we perform to match between 
complex constituents. Because of space limitations, we can only present the basic 
principles for the most common cases. To match between two we: 

~ derive a transfer rule of type trMsc for the main verb; 

— derive a transfer rule of type tr-acc for subconstituents of which the trans- 
lation depends on the main verb, e.g. to map a postpositional object to 
a prepositional object, and continue the matching recursively for the two 
subconstituents without adpositions; 

— derive transfer rules of type tr_cc or tr_sc to map Japanese subconstituents 
to different German subconstituents (e.g. a predicative adverb to an adver- 
bial phrase, an adverbial phrase to a predicative adjectival phrase, etc.); if 
possible, matching is continued recursively for the congruent parts of the 
two subconstituents; 

— apply find-optjpar to search for corresponding subconstituents for subject, 
direct object, preceding clause, etc., and apply matching recursively to these 
subconstituents; 

— derive transfer rules for conjunctions and syntactic features. 




22 



W. Winiwarter 



To match between two we derive a transfer rule of type 

tr_acc to translate the postposition and continue matching recursively for both 
phrases without adpositions. Finally, to match between two we: 

— derive either a default transfer rule of type tr_asc for the head noun or a 
transfer rule of type tr_acc for specific translations of head/modifier combi- 
nations (e.g. head noun and attributive noun phrase); 

— derive transfer rules of type tr_cc or trsc to map Japanese subconstituents 
to different German subconstituents (e.g. an attributive adjectival phrase to 
a relative clause); if possible, matching is continued recursively; 

— apply find-optjpar to search for corresponding subconstituents for attribu- 
tive verb phrase, attributive adjectival phrase, coordinated noun phrase, etc., 
and apply matching recursively to these subconstituents; 

— derive transfer rules for conjunctions and syntactic features. 

Each rule which is not already in the rule base is validated against the existing 
rules to resolve any conflicts resulting from adding the new rule. This resolution is 
achieved by making general rules more specific. The distinction between general 
rules for default situations and specific rules for exceptions is drawn according 
to the frequency of occurrence in the collection of sentence pairs translated in 
the past. This way we are independent from the chronological order of analyzing 
new examples, i.e. the rule acquisition is not affected if an exception is learnt 
before a general case. Figure 2 shows the rules that are learnt from the German 
translation of the Japanese sentence in Fig. 1 (the syntactic features for the 
subject correspond with the default values so that no new rule is derived). 

5 Application of Transfer Rules 

The transfer component traverses the Japanese syntax tree in a top-down fash- 
ion and searches for transfer rules to be applied. We always check first whether 
the conditions for more specific transfer rules are satisfied before applying more 
general rules. As explained in Sect. 3 transfer rules can also perform only partial 
translations of complex constituents, leaving some parts unchanged to be trans- 
formed later on. This flexible and robust approach requires that the transfer 
component is able to deal with parsing trees that contain mixed representations 
consisting of original Japanese parts, and German parts that were already trans- 
lated. This mixture gradually turns into a fully translated German parsing tree. 
Figure 3 shows the application of the transfer rules from Fig. 2. 

As in Sect. 4, we can only highlight the basic principles of the transfer al- 
gorithm. By making use of the generic predicates to manipulate complex con- 
stituents (see Sect. 2) we have defined the predicate _ (G, Al, A2) to translate 

the argument Al of a constituent C{A1) into A2. For simple constituents this 
involves just the application of rules of type tr_asc, for complex constituents 
we perform the following principal steps: find and apply transfer rules of type 
tr_acc, transfer rules of type trMsc for the head word, and transfer rules for con- 
junctions and syntactic features; recursively call predicates for the translation of 
all subconstituents. 
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[hew(Sibt5tr'5/ver), hwf(vta), 
pav(li C.ifcT/adv), 
adp([hew(cti1ft/nou), 
php(|::fd:or/par)]), 
sub([hew(:^/nou), 
anp([hew(i^/nou), 

anp([hew(L'i/nou)])])])] 



[hew(auftreten/ver), ten(per), 

adp([hew('Mar/nou) , php(zu/prp) , 
det(def), num{sng), seq(erst/ord)]), 
adp([hew('Mittelalter'/nou), php(in/prp), 
det(def), num(sng)]), 

sub([hew{'Buch'/nou), det(def), num(sng), 
app{[hew('Form'/nou), php(in/prp), 
det(psv), num(sng), 
aap([hew(heutig/adj), com(pos)])])])] 



1. Structural matching between verb phrases 
tr_asc( fe b tdtl. -5/ver, auftreten/ver) . 

tr_sc(pav, adp, It Lf toT/adv, [php(zu/prp), det(def), seq(erst/ord), num(sng), hew('Mal7nou)]). 
tr_vff(vta, [ten(per)]). 

2. Structural matching between adverbial phrases 
tr_acc(ititM:/nou, [php(lc^f oT/par), hew(iti1ft/nou)], 

[php(in/prp), det(ind), num(sng), hew('Mittelalter'/nou)]). 

3. Structural matching between noun phrases 
tcascCtitfi/nou, 'Mittelalter'/nou). 

4. Structural matching between noun phrases 
tr_asc(^;/nou, 'Buch'/nou). 

tr_cc(anp, app, ®/nou, [hew(®/nou), anp([hew(L'^/nou)])], 

[php(in/prp), det(psv), num(sng), hew('Form7nou), aap([com(pos), hew(heutig/adj)])]). 

5. Structural matching between noun phrases 
tr_asc(JK/nou, 'Form7nou). 

tr_cc(anp, aap, t'S/nou, [hew(t'S/nou)], [com(pos), hew{heutig/adj)]). 



Fig. 2. Example of rule acquisition 



The predicate _ (^1, -A2) is used for finding and applying transfer rules of 

type _ ; if no transfer rule can be applied, the constituent is left unchanged: 



(A1,A2):- _ ( ,A1, ), _ ( 

(Al, 1, ), ( 2,, ,A2). 

(A, A). 



1 , 2 ), 



The recursive call for translating a subconstituent ( ) is realized with 

the predicate _ ( ,A1,A2): 



( ,A1,A2):- _ ( ,A1,, ), _ 



and the predicate 
that either: 



( , ,A1,A2). 

, Al, A2), which consists of several rules 



find and apply a rule of type _ : _ _ ( ,, ,A1,A2):- 

- ( , 2,, , 2), _ ^ 2,A1, 2,A2). 

find and apply rules of type _ ( _ is defined in a similar way to _ ) : 



- - ( , ,A1,A2):- _ ( 

- . ( , 2,A1, 2,A2). 

recursively call _ for translating : 

- ( , , 2 ), \== 2 , 

otherwise leave the constituent unchanged: 






2 ), 



( , ,A1,A2):- 

( 2,A2). 

(_,_,A,A). 
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[hew(fe bt^ti-S/ver), hwf(vta), pav((i ll.i^'C/adv), adp([hew(i4'1S/nou), php([c^ o "C/par)]), 
sub([hew(:^/nou), anp([hew(ft^/nou), anp([hew{L'^/nou)])])])] 

1. tr_asc(SS^^S/ver, auftreten/ver). 

[hew(auftreten/ver), hwf(vta), pav((i CibT/adv), adp([hew{4''tt/nou), php{lC7^-D'C/par)]), 
sub([hew(:^/nou), anp([hew(ft^/nou), anp([hew{L'^/nou)])])])] 

2. tr_vff(vta, [ten(per)]). 

[hew(auftreten/ver), ten(per), pav((i Ci^)'C/adv), adp([hew(4''tt/nou), php((C7jf oT/par)]), 
sub([hew(:^/nou), anp([hew(Jf^/nou), anp([hew{L'^/nou)])])])] 

3. tr_sc(pav, adp, It Cdb'C/ady, [php(zu/prp), det(def), seq(erst/ord), num(sng), hew('Mal/nou)]). 

[hew(auftreten/ver), ten(per), adp([php(zu/prp), det(def), seq{erst/ord), num(sng), hew{'Mal'/nou)]), 

adp([hew(4'1i’/nou), php(lC7^oT/par)]), sub([hew{:$^/nou), anp([hew(ff^/nou), anp{[hew(L'S/nou)])])])] 

4. tr_acc(fM/nou, /p/?p('/r^'o T/par), hew{4^Wnou)], [php(in/prp), det(ind), num(sng), hew('MittelalterYnou)]). 

[hew(auftreten/ver), ten(per), adp([php(zu/prp), det(def), seq(erst/ord), num(sng), hew('Mar/nou)]), 
adp([php(in/prp), det(ind), num(sng), hew('Mittelalter'/nou)]), 
sub([hew(:^/nou), anp([hew(Jf^/nou), anp([hew{L'^/nou)])])])] 

5. tr_asc(^/nou,'Buch/nou). 

[hew(auftreten/ver), ten(per), adp([php(zu/prp), det(def), seq(erst/ord), num(sng), hew('Mar/nou)]), 
adp([php(in/prp), det(ind), num{sng), hew{'Mittelalter'/nou)]), 

sub{[hew('Buch'/nou), det(def), num(sng), anp([hew(ff^/nou), anp([hew(L'^/nou)])])])] 

6. tr_cc(anp, app, Yf^/nou, [hew(0/nou), anp([hew(L^^/nou)])], 

[php(in/prp), det(psv), num(sng), hew('Form/nou), aap([com(pos), hew(heutig/adj)])]). 

[hew(auftreten/ver), ten(per), adp([php(zu/prp), det(def), seq(erst/ord), num(sng), hew('Mar/nou)]), 
adp([php(in/prp), det(ind), num(sng), hew('Mittelalter7nou)]), 

sub{[hew('Buch'/nou), det(def), num(sng), app([php(in/prp), det(psv), num{sng), hew('Form'/nou), 
aap([hew(com(pos), heutig/adj)])])])] 



Fig. 3. Example of rule applications 



6 Conclusion 

In this paper we have presented a machine translation system, which automati- 
cally learns transfer rules from translation examples by using structural matching 
between parsing trees. We have completed the implementation of the system and 
are now in the process of creating a rule base of reasonable size with the assis- 
tance of several language students from our university. So far, their feedback 
regarding the usefulness of PETRA for their language studies has been very 
positive. After we have reached a certain level of linguistic coverage, future work 
will concentrate on a thorough evaluation of our system. 
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Abstract. This paper compares the accuracy of several variations of the Bleu 
algorithm when applied to automatically evaluating student essays. The different 
configurations include closed-class word removal, stemming, two baseline word- 
sense disambiguation procedures, and translating the texts into a simple semantic 
representation. We also prove empirically that the accuracy is kept when the student 
answers are translated automatically. Although none of the representations clearly 
outperform the others, some conclusions are drawn from the results. 



1 Introduction 

Computer-based evaluation of free-text answers has been studied since the sixties [1], 
and it has attracted more attention in recent years, mainly because the popularisation 
of e-learning courses. Most of these courses currently rely only on simple kinds of 
questions, such as multiple choices, fill-in-the-blanks or yes/no questions, although it 
has been argued that this way of assessment is not accurate enough to measure the student 
knowledge [2]. 

[3] classifies the techniques to automatically assess free-text answers in three main 
kinds: 

- Keyword analysis, that only looks for coincident keywords or n-grams. These can 
be extended with the Vector Space Model and with Latent Semantic Indexing pro- 
cedures [4]. 

- Full natural-language processing, which performs a full text parsing in order to 
have information about the meaning of the student’s answer. This is very hard to 
accomplish, and systems relying on this technique cannot be easily ported across 
languages. On the other hand, the availability of a complete analysis of the student’s 
essay allows them to be much more powerful. For instance, E-rater [5] produces a 
complete syntactic representation of the answers, and C-rater [6] evaluates whether 
the answers contain information related to the domain concepts and generates a 
fine-grained analysis of the logical relations in the text. 

- Information Extraction (IE) techniques, that search the texts for some specific con- 
tents, but without doing a deep analysis. [3] describe an automatic system based on 
IE. 



* This work has been sponsored by CICYT, project number TIC2001-0685-C02-01. 
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[7] provide a general overview of CAA tools. 

In previous work, we have applied Bleu [8] to evaluate student answers [9, 10], 
with surprisingly good results, considering the simplicity of the algorithm. In this pa- 
per we focus on improving the basic Bleu algorithm with different representations of 
the student’s text, by incorporating increasingly more complex syntactic and semantic 
information into our system. 

The paper is organised as follows: in Section 2 we describe the variations of the 
original algorithm. Section 3 describes how the algorithm could be ported through lan- 
guages automatically with a very slight loss in accuracy. Section 4 explains how it could 
be integrated inside an e-learning system. Finally, conclusions are drawn in Section 5. 



2 Variations of the Scoring Algorithm 

2.1 The Original Bleu Algorithm 

Bleu [8] is a method originally conceived for evaluating and ranking Machine Translation 
systems. Using a few reference translations, it calculates an n-gram precision metric: the 
percentage of n-grams from the candidate translation that appear in any of the references. 
The procedure, in a few words, is the following: 

1 . For several values of N (typically from 1 to 4), calculate the percentage of n-grams 
from the candidate translation which appears in any of the reference texts. The 
frequency of each n-gram is limited to the maximum frequency with which it appears 
in any reference. 

2. Combine the marks obtained for each value of N, as a weighted linear average. 

3. Apply a brevity factor to penalise the short candidate texts (which may have many 
n-grams in common with the references, but may be incomplete). If the candidate is 
shorter than the references, this factor is calculated as the ratio between the length of 
the candidate text and the length of the reference which has the most similar length. 

The use of several references, made by different human translators, increases the 
probability that the candidate translation has chosen the same words (and in the same 
order) as any of the references. This strength can also be considered a weakness, as this 
procedure is very sensitive to the choice of the reference translations. 

2.2 Application in e-Leaming 

In the case of automatic evaluation of student answers, we can consider that the students’ 
responses are the candidate translations, and the teacher can write a set of correct answers 
(with a different word choice) to be taken as references [9]. Contrary to the case of 
Machine Translation, where the automatic translation is expected to follow more or less 
rigidly the rhetorical structure of the original text, the students are free to structure their 
answers as they fancy, so it is to be expected a lower performance of Bleu in this case. 

For evaluation purposes, we have built six different benchmark data from real exams, 
and an additional one with definitions obtained from Google Glossary'. The seven sets. 



* http://www.google.com, writing “define:” in the query. 
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Table 1. Answer sets used in the evaluation. Columns indicate: set number; number of candidate 
texts (NC), mean length of the candidate texts (MC), number of reference texts (NR), mean 
length of the reference texts (MR), language (En, English; Sp, Spanish), question type (Def., 
dehnitions and descriptions; A/D, advantages and disadvantages; Y/N, Yes-No and justification of 
the answer), and a short description of the question 



SET NC MC NR MR Lang Type Description 



1 


38 


67 


4 


130 En 


Def. 


"Operating System" defs. from "Google Glossary" 


2 


79 


51 


3 


42 


Sp 


Def. 


Exam question about Operating Systems 


3 


96 


44 


4 


30 


Sp 


Def. 


Exam question about Operating Systems 


4 


143 48 


7 


27 


Sp 


A/D 


Exam question about Operating Systems 


5 


295 56 


8 


55 


Sp 


A/D 


Exam question about Operating Systems 


6 


117 127 5 


71 


Sp 


Y/N 


Exam question about Operating Systems 


7 


117 166 3 


186 Sp 


A/D 


Exam question about Operating Systems 



which include more than 1000 answers altogether, are described in Table 1. All the an- 
swers were scored by hand by two different human jndges, who also wrote the reference 
texts. The instructions they received were to score each answer in a scale between 0 
and a maximum score (e.g. 1 or 10), and to write two or three reference answers for 
each question. We are currently transcribing other three sets corresponding to three more 
qnestions, but we still have only a few answers for each. 

We have classified the ten questions in three distinct categories: 

— Definitions and descriptions, e.g. “What is an operative system?”, “Describe how 
to encapsulate a class in C++”. 

— Advantages and disadvantages, e.g. “Enumerate the advantages and disadvantages 
of the token ring algorithm” . 

— Yes/No question, e.g. “Is RPC appropriate for a chat server? ( Justify your answer) ”. 

All the answers were marked manually by at least two teachers, allowing for inter- 
mediate scores if the answer was only partially correct. For instance, if the maximum 
score for a given question is defined as 1.5, then teachers may mark it with intermediate 
values, such as 0, 0.25, 0.5, 0.6, etc.. 

The discourse structure of the answer is different for each of these kinds. Definitions 
(and small descriptions) are the simplest ones. In the case of enumerations of advantages 
and disadvantages of something, stndents can structure the answer in many ways, and 
an Ngram-based procedure is not expected to identify mistakes such as citing something 
which is an advantage as a disadvantage. 

2.3 Modified Brevity Penalty Factor 

As discussed above, Bleu measures the n-gram precision of a candidate translation: the 
percentage of n-grams from the candidate that appear in the references. This metric is 
multiplied by a Brevity Penalty factor; otherwise, very short translations (which miss 
information) might get higher results than complete translations that are not fully ac- 
curate. In a way, this factor is a means to include recall intro the metric: if a candidate 
translation has the same length as some reference, and its precision is very high, then its 
recall is also expected to be high. 
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Fig. 1. Procedure for calculating the Modified Brevity Penalty factor 



In contrast, ROUGE, a similar algorithm used for evaluating the output of automatic 
summarisation systems, only measures recall as the percentage of the n-grams in the 
references which appear in the candidate summary, because the purpose of a summary 
is to be maximally informative [11,12]. For extract generation, in many cases precision 
can be taken for granted, as the summary is obtained from parts of the original document. 

We argued, in previous work, that the Bleu’s brevity penalty factor is not the most 
adequate one for CAA [9, 10]. In the case of student answers, both recall and precision 
should be measured, as it is important both that it contains all the required information, 
and that all of it is correct. 

We currently encode the recall of the answer with a modified Brevity Penalty factor, 
that we calculate using the following procedure [10]: 

1 . Order the references in order of similitude to the candidate text. 

2. For N from a maximum value (e.g. 10) down to 1, repeat: 

(a) For each N-gram from the candidate text that has not yet been found in any 
reference, 

(b) If it appears in any reference, mark the words from the N-gram as found, both 
in the candidate and the reference. 

3. For each reference text, count the number of words that are marked, and calculate 
the percentage of the reference that has been found. 

4. The Modified Brevity Penalty factor is the sum of all the percentage values. 

Figure 1 describes how the factor is calculated. The results using this modified Brevity 
Penalty factor are better, statistically significant with 0.95 confidence. Surprisingly for 
us, the best result using this modified penalty factor was obtained just for unigrams. 
In automatic summarisation evaluations, unigrams have been found also to work better 
than n-grams in some cases [11,12]. 

2.4 Extensions Proposed 

There are a number of simple modifications to the original algorithm: 

1 . Stemming: to be able to match nouns and verbs inflected in different ways, e.g. to 
manage and/or managing. 
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2. Removal of Closed-Class Words. These are usually Important for finding matching 
N-grams for long values of N; however, in the case of nnigrams, they are proba- 
bly present in every kind of text and are not very informative. Given that the best 
correlation was obtained just for unigrams, these are not that important. 

3. Word-Sense Disambiguation. If we were able to identify the sense intended by 
both the teacher and the stndent, then the evaluation wonld be more accurate. We 
do not have any Word-Sense Disambiguation procedure available yet, so we have 
tried with the following baseline methods: 

- For English, we have used the Semcor corpus [ 1 3] to find the most popnlar word 
sense for each of the words in WordNet, which is the sense we always take. In 
this case, we substitute every word Wi in the candidate and the references by 
the identifier of the synset such that Wi is tagged with that identifier in Semcor 
more times than with any other. 

- For Spanish, as we do not have a corpus with semantic annotations, we always 
choose, for every word, the first sense in the Spanish WordNet database [14]. 

- For both languages, we have also tried by substituting each word by the list of 
the identifiers of all the synsets that contain it. In this case, we shall consider 
that two n-grams match if their intersection is not empty. 

Figure 2 (in the next page) shows how the input text is modified before sending it to 
the modified Bleu algorithm. In any of these cases, the unigram co-occnrrence metric is 
caicniated after this processing. The algorithm is only modified in the last case, in which 
the procedure to check whether two unigrams match is not a string equality test, but a 
test that the set intersection is not empty. 

2.5 Representing Syntactic Dependences 

In order to extend the system with information about the syntactic dependences between 
the words in the texts, we have tried an extended version of the system in which the ref- 
erences and the candidate answer are analysed with a parser and next the dependences 
between the words are extracted. The library we have used for parsing is the wraet 1 ic 
tools [15]l 

Figure 3 shows the dependences obtained from the candidate text from Figure 2. 
This can be taken as a first step in obtaining a logical representation of the text, but there 
are currently some limitations of our parser which do not allow ns to produce a more 
reliable semantic analysis: it does not currently support prepositional-phrase attachment 
or coreference resolution. 

2.6 Analysis and Discussion 

Table 2 shows, for each data set and configuration of the algorithm, the results measured 
as the correlation between the teacher’s scores and the scores prodnced antomatically. 
Correlation is a metric widely used for evaluating antomatic scoring systems [16,6, 
17,7]. Given the array of scores X assigned by the teacher, and the array of scores Y 
assigned antomatically, the correlation is defined as 



^ Available at www.ii.uam.esC ealfon/eng/download.html 
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Original: Collection of programs that supervises the execution of other programs and the manage- 
ment of computer resources. An operating system provides an orderly input/output environment 
between the computer and its peripheral devices. It enables user-written programs to execute 
safely. An operating system standardizes the use of computer resources for the programs running 
under it. 

Stemmed: [Collection, of, program, that, supervise, the, execution, of, other, program, and, the, 
management, of, computer, resource. An, operating, system, provide, an, orderly, input, envi- 
ronment, between, the, computer, and, its, peripheral, device. It, enable, user-written, program, 
to, execute, safely. An, operating, system, standardize, the, use, of, computer, resource, for, the, 
program, run, under, it] 



Without closed-class words: [Collection, programs, supervises, execution, other, programs, 
management, computer, resources, operating, system, provides, orderly, input/output, environ- 
ment, computer, peripheral, devices, enables, user-written, programs, execute, safely, operating, 
system, standardizes, use, computer, resources, programs, running] 



Stemmed, no closed-class words: [Collection, program, supervise, execution, other, program, 
management, computer, resource, operating, system, provide, orderly, input, environment, com- 
puter, peripheral, device, enable, user-written, program, execute, safely, operating, system, stan- 
dardize, use, computer, resource, program, run] 



Most-common synset: [n06496793, of, n04952505, that, v01821686, the, n00842332, of, 
a02009316, n04952505, and, the, n00822479, of, n02625941, nl 1022817, An, operating, 
n03740670, v01736543, an, a01621495, n05924653, nl 1511873, between, the, n02625941, and, 
its, a00326901, n02712917. It, v00383376, user-written, n04952505, to, v01909959, r00152042. 
An, operating, n03740670, v00350806, the, n00682897, of, n02625941, nl 1022817, for, the, 
n04952505, v01433239, under, it] 



Most-common synset, no closed-class words: [n06496793, n04952505, v01821686, 
n00842332, a02009316, n04952505, n00822479, n02625941, nl 1022817, operating, 
n03740670, v01736543, a01621495, n05924653, nll511873, n02625941, a00326901, 
n02712917, v00383376, user-written, n04952505, v01909959, r00152042, operating, 
n03740670, v00350806, n00682897, n02625941, nll022817,n04952505,v01433239] 



All synsets: [[Collection], [of], [n00391804, n04952505, n04952916, n05335777, n05390435, 
n05427914, n05472858, n05528119], [that], [v01615271, v01821686], [the], [n00068488, 
n00817656, n00842332, nil 140581], [of], [a02009316], [n00391804, n04952505, n04952916, 
n05335777, n05390435, n05427914, n05472858, n05528119], [and], [the], [n00822479, 
n06765853], [of], [n02625941, n07941303], [n04334536, n04749592, nll022817], ...] 



All synsets, no closed-class words: [[Collection], [n00391804, n04952505, n04952916, 
n05335777, n05390435, n05427914, n05472858, n05528119], [v01615271, v01821686], 
[n00068488, n00817656, n00842332, nlll40581], [a02009316], [n00391804, n04952505, 
n04952916, n05335777, n05390435, n05427914, n05472858, n05528119], [n00822479, 
n06765853], [n02625941, n07941303], [n04334536, n04749592, nl 1022817], ...] 



Fig. 2. Modification of a candidate answer depending on the configuration of the automatic scorer. 
The synset identifiers in the last four cases are taken from WordNet 1.7 
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supervise( [Collection], [execution, management]); 
provide([Operating_System], [environment, devices]); 
enable([It], [programs]); 
execute([], []); 

standardize([Operating_Sy stem] , [use] ) ; 

run( [resources], []); 

user-written(program) ; 

computer(resource) ; 

of(programs); 

of(resources); 

for(programs); 

underfit); 



Fig. 3. Dependences obtained from the syntactic representation of the candidate answer 



Table 2. Correlation between Bleu and the manual scores (left column), and correlations for the 
modified algorithm in different configurations (those from Figure 2) and, finally, the results using 
the syntactic dependences 



Set Bleu Basic stem cc stem-cc WSD WSD-cc 



1 0.5886 0.5859 0.6189 0.5404 0.5821 0.6322 0.5952 

2 0.3609 0.5244 0.4832 0.5754 0.4797 0.4706 0.4655 

3 0.3694 0.3210 0.2364 0.3234 0.2917 0.2211 0.2844 

4 0.4159 0.6608 0.6590 0.6811 0.7000 0.6634 0.6933 

5 0.0209 0.1979 0.2410 0.2437 0.3013 0.2434 0.3040 

6 0.2102 0.4027 0.3977 0.4159 0.4046 0.3822 0.3838 

7 0.4172 0.3970 0.4634 0.4326 0.4910 0.4727 0.5261 



All All-cc Deps. 
0.2076 0.1125 0.3243 
0.1983 0.2968 0.2404 
0.1107 0.1431 0.1560 
0.6349 0.6702 0.4139 
-0.0201 0.0450 0.1884 
0.2607 0.3297 0.1302 
0.2880 0.3337 0.1726 



correlation{X, Y) 



covariance{X , Y) 

standardDev(X) x standardDev(Y) 



Some observations can be drawn from these data: 

- There is not any configuration that clearly outperforms the others. 

- The removal of closed-class words improves the results, although it is not statistically 
significant. 

- The rather simple Word-Sense Disambiguation procedure that we have used has 
attained the best results in three cases, and does not produce much loss for the re- 
maining questions. We can take this as an indication that if we had a better algorithm 
these results might be better than the other configurations. 

- The technique of choosing all the synsets containing a given word (with or without 
closed-class words) was clearly poorer than all the others. 

- The metric obtained looking for coincidences of the dependences between words in 
the candidate answer and in the references was also inferior to the other configura- 
tions. This may be due to the fact that the dependences are only collected for some 
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Table 3. Correlation between Bleu and the manual scores (left column), and correlations for the 
modified algorithm in different configurations (those from Figure 2), using an automatic Machine 
Translation system 



Set Bleu Basic stem cc stem-cc WSD WSD-cc All All-cc Deps. 

1 0.5886 0.6174 0.6007 0.5663 0.5702 0.6194 0.5919 0.1519 0.0516 0.4081 

2 0.3609 0.5330 0.4337 0.5479 0.5310 0.4176 0.4841 0.2276 0.2068 0.2501 

3 0.3694 0.1660 0.1736 0.2892 0.2814 0.1734 0.3264 0.0789 0.2035 0.1210 

4 0.4159 0.5937 0.6899 0.6066 0.7567 0.6998 0.7655 0.6008 0.6216 0.3897 

5 0.0209 0.2449 0.2426 0.3213 0.3459 0.2358 0.3282 -0.0102 0.0220 0.0674 

6 0.2102 0.3649 0.3326 0.3450 0.3754 0.3150 0.3586 0.1716 0.3070 0.1607 

7 0.4172 0.4583 0.4635 0.4515 0.4850 0.4510 0.4699 0.2859 0.3452 0.1691 



words because the parses are incomplete, and much information is lost using this 
representation. 

- The correlation values obtained, although they are not high enough to he used in 
practical applications, are better than those obtained with other keyword-based eval- 
uation metrics used in existing systems in combination with other techniques. There- 
fore, we believe that this procedure is very adequate to replace other keyword-based 
scoring modules in more complex evaluation environments (e.g. [5]). 

3 Multilingual Evaluation 

In order to check whether the evaluation system can be automatically ported across 
languages without the need of rewriting the reference texts, we have performed the fol- 
lowing experiment: for each of the questions, we have translated the references manually 
into other language (from Spanish into English and vice versa) and we have translated 
the candidate texts using Altavista Babelfish^. Table 3 shows the results obtained. 

In a real case of an e-learning application, the authors of a course would simply need 
to write the reference texts in their own language (e.g. Spanish). An English student 
would see the question translated automatically into English, would write the answer, 
and next the system would automatically translate it into Spanish and score it against the 
teacher’s references. As can be seen from the results, the loss of accuracy is small; for 
some of the questions, the correlation in the best configuration even increases. Again, 
the removal of closed-class words seems to give better results, and there are two cases 
in which Word-Sense Disambiguation is useful. 

4 Application on an On-line e-Learning System 

Ideally, we would like that a student could submit the answer to a question and receive 
his or her score automatically. This system is not intended to substitute a teacher, but 
might help students in their self-study. We have built an on-line system for open-ended 
questions called Atenea. 



^ Available at http://world.altavista.com/ 
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Fig. 4. Regression line between the teacher’s scores and the system’s scores 



Given a student answer. Bleu provides a numerical score. This score needs to be 
put in the teacher’s scale so the student can understand its meaning. For instance, 
Figure 4 shows the regression line for question 3. It can be seen that Bleu’s scores 
(Y axis) are between 0 and 0.7, while the teacher marked all the answers between 
0 and 1 (X axis). It is also interesting to notice that most of the teacher’s scores 
are either 0 or 1: only a few answers received intermediate marks. In this example, 
if a student receives a Bleu score of 0,65, he or she would think that the answer 
was not very good, while the regression line indicates that that score is one of the 
best. 

Therefore, it is necessary to translate Bleu’s scores to the scale used by the teacher 
(e.g. between 0 and 1). 

We propose the following methods: 

- Evidently, if we have a set of student answers marked by a teacher, then we can 
calculate the regression line (as we have done in Figure 4) and use it. Given Bleu’s 
score, we can calculate the equivalent in the teacher’s score automatically. The 
regression line minimises the mean quadratic error. 

- In same cases it may not be possible to have a set of answers manually scored. In this 
case, we cannot calculate the regression line, but we can estimate it in the following 
way: 

We take the student answers for which Bleu’s scores, bi and &2 are minimal 
and maximal, and we assume that their corresponding scores in the teacher’s scale 
are 0 and 1 (i.e. they are the worst and the best possible answers). The estimated 
regression line will be the line that crosses the two points (0, bi) and (1, 62). This 
is only an approximation, and it has the unwanted feature that if a student produces 
an answer that scores best, then the remaining students will see their scores lowered 
down automatically, as the line will change. 

Table 4 shows the mean quadratic errors produced by the regression line and the way 
to estimate the line unsupervisedly. 
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Table 4. Mean quadratic error for the several regression lines 



Set Regression Two-points 



1 


0.81 


6.33 


2 


8.29 


15.03 


3 


6.78 


8.50 


4 


17.41 


22.73 


5 


25.77 


48.08 


6 


15.59 


16.13 


7 


5.10 


29.63 



5 Conclusions and Future Work 

In previous work [9, 10] we described a novel application of the Bleu algorithm for 
evaluating student answers. We compare here several variations of the algorithm which 
incorporate different levels of linguistic processing of the texts: stemming, a tentative 
word-sense disambiguation and a shallow dependency-based representation obtained di- 
rectly from a syntactic analysis. The results show that in nearly every case some of these 
modifications improve the original algorithm. Although no configuration clearly outper- 
forms all the others, we can see that closed-class words removal is usually useful, and that 
improving the word-sense disambiguation module seems a very promising line to follow, 
given that a baseline procedure for WSD has been found effective for some datasets. 

We also describe a feasible way in which this system might be integrated inside 
e-learning systems, with a little effort on behalf of the course authors, who would only 
be required to write a few correct answers for each of the questions. Although a set of 
manually corrected student answers might be desirable to minimise the mean quadratical 
error, there are roundabouts to omit that work. The coincident n-grams between the 
students’ answers and the references can be useful so they can see which parts of their 
answers have improved their score. Finally, the characteristics of the algorithm make it 
very natural to be integrated in adaptive courses, in which the contents and tasks that the 
students must complete depend on their profiles or actions. Just by providing a different 
set of the reference answers (e.g. in other language, or for a different subject level), 
the same question can be evaluated in a suitable way depending on the student model. 
Furthermore, we have seen that it can be automatically ported across languages using a 
state-of-the-art Machine Translation system with no or small loss in accuracy. 

Future work include the following lines: 

- Improving the word-sense disambiguation module, and integrating it with a logical 
formalism of representation, so the predicates and their arguments are not words but 
synset identifiers. 

- Study better models for estimating the regression line when the answers corrected 
by the teacher are not available. 

- Extend the algorithm so that it is capable of discovering the internal structure of 
the answer. This would be desirable, for instance, when evaluating enumerations 
of advantages or disadvantages, where it is necessary to discover if the student is 
referring to one of the other. 




Automatic Assessment of Open Ended Questions 



35 



- Explore the multilingual evaluation, to discover why is it the case that the correlation 
increases in some cases. A possible reason may be that the automatic translations 
employ a more reduced vocabulary. 

- Perform a full integration of Atenea with the web-based adaptive e-learning system 
TANGOW [18], which has also been developed at our home university. 
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Abstract. In this paper we propose the use of an HMM-based phonetic 
aligner together with a speech-synthesis-based one to improve the 
accuracy of the global alignment system. We also present a phone 
duration-independent measure to evaluate the accuracy of the automatic 
annotation tools. In the second part of the paper we propose and evaluate 
some new conhdence measures for phonetic annotation. 



1 Introduction 

The flourishing number of spoken language repositories has pushed speech re- 
search in multiple ways. Much of the best speech recognition systems rely on 
models created with very large speech databases. Research into natural prosody 
generation for speech synthesis is, nowadays, another important issue that uses 
large amounts of speech data. These repositories have allowed the development 
of many corpus-based speech synthesizers in the recent years, but they need 
to be phonetically annotated with a high level of precision. However, manual 
phonetic annotation is a very time-consuming task and several approaches have 
been taken to automate this process. Although state-of-the-art segmentation 
tools can achieve very accurate results, there are always some uncommon acous- 
tic realizations or some kind of noise that can badly damage the segmentation 
performance for a particular file. With the increasing size of speech databases 
manual verification of every utterance is becoming unfeasible, thus, some con- 
fldence scores must be computed to detect possible bad segmentations within 
each utterance. The goal of this work is the development of a robust pho- 
netic annotation system, with the best possible accuracy, and the development 
and evaluation of confidence measures for phonetic annotation process. This 
paper is divided into 4 sections, the section 2 describes the development of the 
proposed phonetic aligner. In the following section (section 3), we describe 
and evaluate the proposed confidence measures, and the conclusions in the 
last section. 
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2 Automatic Segmentation Approaches 

Automatic phonetic annotation is composed of two major steps, the determi- 
nation of the utterance phone sequence, the sequence produced by the speaker 
during the recording procedure, and the temporal location of the segment bound- 
aries (phonetic alignment). Several phonetic alignment methods have been pro- 
posed, but the most widely explored techniques are based either on Hidden 
Markov Models (HMM) used in forced alignment mode [1] or on dynamic time 
alignment with synthesized speech [2] . The main reason of the superiority of two 
techniques is their robustness and accuracy, respectively. An HMM-based aligner 
consists of a finite state machine that has a set of state occupancy probabili- 
ties in each time instant and a set of inter-state transition probabilities. These 
probabilities are computed using some manually or automatically segmented 
data (training data) . On the other hand, the speech-synthesis-based aligners are 
based on a technique used in the early days of the speech recognition. A syn- 
thetic speech signal is generated with the expected phonetic sequence, together 
with the segment boundaries. Then, some spectral features are computed from 
the recorded and the generated speech signals, and finally the Dynamic Time 
Warping (DTW) algorithm [3] is applied to compute the aligned path between 
the signals for which there is a better match between the spectral features. The 
reference signal segment boundaries are mapped into the recorded signal us- 
ing this alignment path. A comparison between the results of HMM-based and 
speech-synthesis-based segmentation [4] has showed that in general (about 70% 
of times) the speech-synthesis-based segmentation is more accurate than the 
HMM-based one, however, it tends to generate few large boundary errors (when 
it fails it fails badly). This means that the HMM-based phonetic aligners are 
more reliable. 

The lack of robustness of the speech-synthesis-based aligners as well as its 
better boundary location accuracy suggested the development of an hybrid sys- 
tem, a system as accurate as the speech-synthesis-based aligner and with the 
robustness of the hmm-based aligners. 

2.1 Speech Synthesis Based Phonetic Aligners 

The first conclusion taken from the usage of some commonly used speech- 
synthesis-based aligners is that the acoustic features does not prove to be equally 
good for locating the boundaries for every kind of phonetic segment. For instance, 
although the energy is, in general, a good feature to locate the boundary between 
a vowel and a stop consonant, it performed poorly on locating the boundary be- 
tween two vowels. Thus, some experiments were performed with multiple acous- 
tic features and multiple segment transitions to find the best acoustic features 
to locate the boundaries between each different pair of phonetic segments. This 
acoustic feature selection considerably increased the robustness of the result- 
ing aligner. The reference speech signal was generated using the Festival Speech 
Synthesis System [5] using a Portuguese voice recorded at our lab. A detailed 
description of this work can be found in [6]. 
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2.2 HMM Based Phonetic Aligners 

Once the speech-synthesis-based aligner was built with a good enough robust- 
ness, it was used to generate the training data for the HMM-based aligner. Given 
the amount of available training data, context-independent models were chosen 
for the task. Figure 1 shows the different phone topologies. The upper one is 
used for all phonetic segment but the silence, semi- vowels and shwa. The central 
topology is used to represent segments with short durations like the semi-vowels 
and shwa, by allowing a skip between and first and last states. The silence 
model is the lower one. In this case a transition from the first state to the last 
as well as another one from the last to the first state can be observed, this can 
be used to model very large variations on the duration of the silences in the 
speech database. Each model states consists of a set of eight gaussian mixtures. 
The adopted features were the Mel-Frequency Cepstrum Coefficients, their first 
and second order differences and the energy and its first and second differences. 
Each frame is spaced by 5-miliseconds , with a 20-milisecond long window. The 
training of the model was preformed by using the HTK toolkit. 



— 




Fig. 1. Three HMM topologies were used for the different kinds of phonetic segments. 
The upper one is the general model, the central topology in used for semi-vowels and 
shwa, and the last one for the silence 

2.3 Segment Boundary Refinement 

As expected, using the HMM-based aligner, a more robust segmentation was 
obtained. The next step was to use our speech-synthesis-based aligner to refine 
the location of the segment boundaries. 

2.4 Segmentation Results 

Two of the most common approaches to evaluate the segmentations’ accuracy 
is to compute of the phonetic segment percentage that have boundary location 
errors less than a given tolerance (often 20 ms) , or the root mean square error of 
the boundary locations. Although these can be good predictors for aligners’ ac- 
curacy, it is clear that an error of about 20 ms in a 25-ms long segment is much 
more serious that the same error in a 150-ms long segment. In the first case 
the segment frames are almost always badly assigned. This way, a phone-based 
duration-independent measure is proposed to evaluate the aligners’ accuracy, 
that is to determine the percentage of well assigned frames, within the segment. 
We will call it the (OvR). Fig. 2 illustrates the computation of 
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this measure. Given a segment, a reference segmentation (RefSeg), and the seg- 
mentation to be evaluated(AutoSeg), OvR is the ratio between the number of 
frames that belong to that segment in both the segmentations (Commori-Dur 
in the fig. 2) and the number of frames that belong to the segment in one seg- 
mentation, at \east{Dur jrnax if the fig. 2). The following equation illustrates the 
computation of OvR\ 

^ ^ Commori-Dur Commori-Dur 

Dur_max Dur_ref + Dur_auto — Common_Dur 



RefSeg 



AutoSeg 



Dur_ref 



... 


?• >> 

seg=A 








Common_Dur 

t : 






... 


seg=A 

Dur_auto 




... 


• 






r 



Dur_max 



Fig. 2. Graphical representation of the quantities involved in the computation of the 
Overlap Rate 



Regarding the equation 1, one can realize that if, for example, a phone du- 
ration in the reference segmentation differs considerably from its duration in 
the other segmentation, the OvR quantity takes a very small value. Let X be 
the Durjref , Y the Dur_auto and z the Common_Dur of Fig. 2, and suppose 
X <Y, thus: 



0 < OvR = 



X 



X + Y- z~ Y 



( 2 ) 



since the number of common frames (z) is at most the same as the minimum 
number of frames in the two annotations of the given segment. This way, one 
can conclude that this measure is duration independent, and is able to produce 
a more reliable evaluation of the annotation accuracy. 

Figure 3, shows the accuracy of the three developed annotation tools. The 
x-axis is the percentage of incorrectly assigned frames {{1 — OvR) T00%) and the 
y-axis is the percentage of phones that has a percentage of incorrectly assigned 
frames lower than the value given in the x-axis. The solid line represents the 
accuracy of the HMM-based aligner, the dashed line is the accuracy of the speech- 
synthesis-based aligner when it is used to refine the results of the HMM-based 
aligner. The dotted line represents the accuracy of the speech-synthesis-based 
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Accuracy of the different aligners 




Fig. 3. Annotation accuracy for the three tested annotation techniques 

aligner when no other alignments were available. In fact, these results are not a 
fair comparison among the multiple annotation tools, because the HMM-based 
aligner is an aligner adapted to the speaker, while the speech-synthesis-based 
aligners are not. Nevertheless, the phone models used in the HMM-based aligner 
were trained on data aligned by the the speech-synthesis-based aligner. These 
results also suggest that the use of HMM-based along with speech-synthesis- 
based annotation tools can be worthy as the former is more robust and the later 
is more accurate. 



3 Confidence Scores 

In this section we propose some phone-based confidence scores for detecting 
misalignments in the utterance. The goal is to locate regions of the speech signal 
where the alignment method may have failed and that could benefit from human 
intervention. 

3.1 The Chosen Features 

The alignment process provides a set of features that can be used as indicators 
of annotation mismatch. This set of features is described below. 

— DTW mean distance: mean distance between the features of the recorded 
signal frames and the synthesized speech signal over the alignment path for 
a given phone; 

— DTW variance: variance of the mean distance between the features of the 
recorded signal frames and the synthesized speech signal over the alignment 
path for a given phone; 

— DTW minimal distance: minimal distance between the features of the 
recorded signal frames and the synthesized speech signal over the alignment 
path for a given phone; 
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~ DTW maximal distance: maximal distance between the features of the 
recorded signal frames and the synthesized speech signal over the alignment 
path for a given phone; 

— HMM mean distance: mean distance between the features of the recorded 
signal frames and the phone model; 

— HMM variance: variance of the distance between the features of the recorded 
signal frames and the phone model; 

— HMM minimal distance: minimal distance between the features of the 
recorded signal frames and phone model; 

~ HMM maximal distance: maximal distance between the features of the 
recorded signal frames and phone model 

Each segment of the database is associated with a vector of features that will 
be used to predict a confidence score for the alignment of that phone. To provide 
some context we decided to include not only the feature vector of the current 
phone but also the feature vectors of the previous and following segments. 

We were now in the condition of performing the evaluation the reliability of 
the different techniques that we propose to detect annotation problems. 

Three different approaches will be evaluated: Classification and Regression 
Trees (CART), Artificial Neural Networks (ANN) and Hidden Markov Models 
(HMM). 

3.2 Definition of Bad Alignment 

A boundary between good and bad alignment is hard to define. Some researchers 
assume that boundary errors larger than 10 miliseconds must be considered 
misalignments, while others are more tolerant. As we explained before, the effect 
of the error in the location of the boundaries may be different from segment to 
segment, depending on its duration. Thus, we will use the duration-independent 
feature proposed before to computed the accuracy of annotation tools: we will 
assume that a misalignment occurs when OvR < 0.75. 

3.3 Classification and Regression Trees 

To train a regression tree we have used the program, that is part of 

Edinburgh Speech Tools[7]. This program can be used to build both classification 
and regression trees, but in this problem it was used as a regression tool to predict 
the values of the OvR based on the former features. We used a training set with 
28000 segments and a test set with 10000 segments. 

Since the leafs of the tree are the average value of OvR and its variance, 
assuming a gaussian distribution in the leafs, we can compute the probability of 
the having OvR with a value lower than the threshold defined in the previous 
subsection. Let /r and a be the average value of OvR and its standard deviation, 
respectively, in a given leaf of the tree. Then, the probability of misalignment is 
given by: 

1 ^0.75 — 

P{OvR < 0.75|/x,cr) = . ^ / e 2 dx (3) 

V2 • 7T • Jo 
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We than had to apply a threshold to the resulting probability. By varying 
these threshold we obtained a Precision/Recall curve represented as a dotted 
line in Fig. 4. 

3.4 Artificial Neural Networks 

Using a neural network simulator developed at our lab, and the same feature 
vectors used in the previous experiment, we trained a binary classifier, which 
computes the probability of misalignment for each segment. As we did in the 
trainning of the regression tree, we had to apply a threshold to the outputs of 
the neural network. The variation of this threshold created the lower dashed line 
of Fig. 4. 

3.5 Hidden Markov Models 

Two one-state models were created for each phonetic segment. A model for 
aligned segments, and a model for the misaligned ones. Since the amount of 
training data were not large enough to build context dependent models, we 
had to choose a context-independent approach. However, we took into account 
the influence of the different contexts in the models in some extent by using 
four gaussian mixtures in each state. Each model was based on the feature vec- 
tors described in 3.1. After model training, we performed a forced alignment 
between the feature vector sequences and the model pairs trained for each pho- 
netic segment. This experiment allowed us to find values for precision and recall 
for each phonetic segment. We depict the experiment results based on phone 
groups (Vowels, Liquids, Nasals, Plosives, Fricatives, Semi- Vowels and the Si- 
lence), which is enough to show that the precision and recall values can vary 
largely with the phone types in analysis. 



Table 1. Best feature pairs for the multiple phonetic segment class transitions 



Class 


Precision(%) 


Recall(%) 


Vowels 


73.2 


69.8 


Liquids 


48.6 


64.0 


Nasals 


82.0 


67.7 


Plosives 


78.7 


72.4 


Fricatives 


88.0 


69.0 


Semi- Vowels 


44.9 


67.5 


Silence 


97.3 


87.8 



Based on the previously trained models, we computed a score{HmmSore) 
for each segment to precision-recall curves, like we did for CART and ANN. This 
score was calculated using equation 4. 



HmmScore = — — 
P{x 



P{x = Al\ModelAi) 

Al\ModelAi) + P{x = Misal\M odel Misai) 



(4) 
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where P{x = Al\ModelAi) is the probability that segment x is aligned given the 
model of aligned phones for that segment and P{x = Misal\ModelMisai)) is the 
probability that segment x is misaligned given its model of misaligned phones. 
The score values are between 0 and 1. We computed the upper curve of Fig. 4 by 
imposing different thresholds to the score, like we had already done for the two 
other approaches. It is important to point out that in this case we are detecting 
the aligned segments rather than misaligned ones. 

3.6 Results 

The results depicted in Fig. 4 suggest the HMM approach outperforms all others 
by far. The other two approaches are very similar, for some applications one 
should choose CARTs, for others one should choose ANNs. 




Fig. 4. Plot of precision and recall of the proposed confidence measures 



4 Conclusions 

In the first part of the paper, we have explored the advantages of using an 
HMM-based aligner together with an aligner based on speech-synthesis, and 
we showed the increase of the accuracy of the combined system, and a new 
measure of alignment accuracy was proposed. In the second part of the paper we 
proposed and evaluated three new approaches to compute confidence measures 
for phonetic annotation. In this part we realized that the approach using HMMs 
is largely the best one. 
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Abstract. We describe a proposal on spelling correction intended to be 
applied on Galician, a Romance language. Our aim is to put into evidence 
the flexibility of a novelty technique that provides a quality equivalent 
to global strategies, but with a significantly minor computational cost. 
To do it, we take advantage of the grammatical background present in 
the recognizer, which allows us to dynamically gather information to the 
right and to the left of the point at which the recognition halts in a word, 
as long as this information could be considered as relevant for the repair 
process. The experimental tests prove the validity of our approach in 
relation to previous ones, focusing on both performance and costs. 



1 Introduction 

Galician belongs to the group of Latin languages, with influence of peoples 
living here before the Roman colonization, as well as contributions from other 
languages subsequent to the breaking-up of this empire. Long time relegated to 
informal usage, it has managed to survive well into the 20*^ century until it 
was once again granted the status of official language for Galicia, together with 
Spanish. Although there several dialects exist, it has been recently standardized 
and, as a consequence, there is a pressing need for tools in order to permit 
a correct linguistic treatment. A main point of interest is the development of 
efficient error repair tools, in particular for spelling correction purposes. 

In this context, the state of the art focuses on global techniques based on the 
consideration of error thresholds to reduce the number of repair alternatives, a 
technique often dependent on the recognizer. So, Oflazer [5] introduces a 

that can be performed efficiently by maintaining a matrix [2] which 
help the system to determine when a partial repair will not yield any result 
by providing non-decreasing repair paths. In order to save this maintaining, 
Savary [6] embeds the distance in the repair algorithm, although this allows 
to partial corrections may be reached several times with different intermediate 
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distances; which is not time-efficient for error threshold values bigger than one. 
Anyway, these pruning techniques are strongly conditioned by the estimation of 
the repair region and their effectiveness is relative in global approaches. 

In contrast to global algorithms, that expend equal effort on all parts of 
the word, also on those containing no errors; we introduce regional repairs 
avoiding to examine the entire word. This is of importance since Galician is 
an inflectional language with a great variety of morphological processes, and a 
non-global strategy could drastically reduce the costs. In effect, work underway 
focusing on word processing, the descriptive model is a regular grammar (rg) 
and the operational one is a finite automaton (fa). At this point, repairs on 
RG’s are explored breadth-wise; whilst the number of states in the associated 
finite automaton (fa) is massive. So, a complex morphology impacts both 
time and space bounds, that can even become exponential; which justifies our 
approach. 

2 The Error Repair Model 

Our aim is to parse a word Wi..„ = Wi . . . Wn according to a RG = {N, S, P, S) . 
We denote by wq (resp. w„+i) the position in the string, previous to wi 

(resp. following w„). We generate from Q a, 

for the language £{G)- In practice, we choose a device [4] generated 
by Galena [3]. A fa is a 5-tuple A = {Q, S,6:Qo, Qf) where: Q is the 
set of states, S the set of input symbols, ^ is a function of Q x S into 2® 
defining the transitions of the automaton, qq the initial state and Qf the set 
of final states. We denote 6{q, a) by q.a, and we say that A is iff 

I q-a \< 1, V (7 G Q, a € S. The notation is transitive, q.w\,,n denotes the state 
(.”. {q.w\) ”7.1). iCn. As a consequence, w is iff q^.w G Qf, that is, the 

. A is defined as C{A) = {w, such that q^.w G Qf}- A FA is 
when the underlying graph it is. We define a FA as a sequence 

of states {< 7 i, . . . ,(/„} , such that Vz G {1, . . . , n — 1}, 3oi G A, qt-Ui = qi+i- In 
order to reduce the memory requirements, we apply a minimization process [1] . 
Two fa’s are iff they recognize the same language. Two states, 

p and q , are iff the FA with p as initial state, and the one that 

starts in q recognize the same language. An FA is iff no pair in Q is 

equivalent . 

2.1 The Dynamic Programming Frame 

Although the standard recognition process is deterministic, the repair one could 
introduce non-determinism by exploring alternatives associated to possibly more 
than one recovery strategy. So, in order to get polynomial complexity, we avoid 
duplicating intermediate computations in the repair of G A’*', storing them 
in a table X of , 1 = {[q,i], q G Q, i G [l,n + 1]}, where [q, i] looks for the 
suffix Wi,,n to be analyzed from q € Q. 

We describe our proposal using [7], a triple (X, with 

H = {[a, i], a = Wi} an initial set of items called , that encodes the 
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word to be recognized^, and T> a set of that allow to derive items 

from previous ones. These are of the form {rji, ..., 77 ^, 1 -^/ }, meaning that 

if all antecedents rji are present and the conditions are satisfied, then the 

consequent ^ is generated. In our case, T> = U pShift^ where: 

pimt ^ ||_ I pShift ^ [q,i + 1] /3[a,t] GH, q= p.a} 

The recognition associates a set of items Sp, called , to each p G Q; 

and applies these deduction steps until no new application is possible. The word 
is recognized iff a , [qf ,n + 1], qf G Qf has been generated. We can 

assume, without lost of generality, that Qf = {qf}, and that exists an only 
transition from (resp. to) qo (resp. qf). To get this, we augment the FA with two 
states becoming the new initial and final states, and relied to the original ones 
through empty transitions, our only concession to the notion of minimal FA. 

2.2 The Formalization 

Let’s assume that we deal with the first error in a word w\,,n G . We extend 
the item structure, [p, i,e], where now is the error counter accumulated in the 
recognition of w at position Wi in state p. We talk about the , Wi, as 

the point at which the difference between what was intended and what actually 
appears in the word occurs, that is, qo.wi,,i-i = q and q.Wi ^ Q. The next step 
is to locate the origin of the error, limiting the impact on the analyzed prefix to 
the context close to the point of error, in order to save the computational effort. 

Since we work with acyclic fas, we can introduce a simple order in Q by 
defining p < q iS exists a path p = {p, . . . ,q}; and we say that qs (resp. qd) 
is a (resp. ) for p iff 3a G S, qg.a = p (resp. q.a = qd). In this 

manner, the pair {qs,qd) defines a 7?.^^ iff Vp, source(p) = qs, we have 

that drain(p) = qd and | {Vp, source(p) = ( 7 ,,} |> I. So, we can talk about 
(7^®^) to refer the set {p/source(p) = qs, drain(p) = qd} and, given q G Q, 
we say that q G 7^®^ iff 3p G paths (7?.^^), q G p. We also consider A as global 
region. So, any state, with the exception of qo and qf, is included in a region. 
This provides a criterion to place around a state in the underlying graph a zone 
for which any change applied on it has no effect over its context. So, we say that 
is the A p G Q iff it verifies that qs > Ps (resp. 

qd < Pd), 3 P, and we denote it as M{p). 

We are now ready to characterize the point at which the recognizer detects 
that there is an error and calls the repair algorithm. We say that Wi is 

associated to a point of error Wj iff 3qd > qo.wi,,j, M.{qo.wi,,j) = 
i> ffiat we denote by (wj) = Wi. We then talk about 7?.^^ ^^ . 

as the Wi. The error is located in the 

left recognition context, given by the closest source. However, we also need to 
locate it from an operational viewpoint, as an item in the process. We say that 
[q,j] G Sq is an iff qo.Wj-i = q; and we say that [p,i] G Sp is a 

associated to Wj iff qo.wt-i = p. 



^ A word G X'+, n > 1 is represented by {[wi, 1], [u>2,2], . . . , [wn,n]}. 
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Once we have identified the beginning of the repair region, we introduce 
a , to Wi..„ € M{w), as a series of edit operations, 

in which each Ei is applied to Wi and possibly consists of a sequences of 
insertions before Wi, replacement or deletion of Wi, or transposition with 
This topological structure can be used to restrict the notion of modification, 
looking for conditions that guarantee the ability to recover the error. So, given 
xi..m a prefix in L{A), and w G such that xw is not a prefix in L{A), we 
define a w x as M{w), so that: 



(1) M{qo.xi, _rn) — (the minimal region including the point of error, xi_rn ) 

(2) 3{qQ.xi„i = qs-Xi, . . . ,qs.Xi„m-M{w)} G paths(7?.«f) 



denoted by (x, w), and by {EI)- We can now organize this concept 

around a point of error, y* G in order to take into account all possible repair 
alternatives. So, we define the j/i, as (?/i) = {xM{w) G 

repair(a;, ■u;)/wi = detection(yi)}. 

Later, we focus on filter out undesirable repairs, introducing criteria to select 
those with minimal cost. For each a, & G 27 we assume insert, /(a); delete, 
D{a), replace, R{a,b), and transpose, T{a,b), costs. The 

M{wi,.n) is given by (M(wi,.„)) = Ej^j^I{aj) + E:D‘^-^{EjejJ{aj) + D{wi) + 
R{wi, b) + T{wi, Wi+i)), where {oj, j G Ji} is the set of insertions applied before 
wf, Wn+i =H the end of the input and = 0. From this, we define the set 

of for yi G 2/1, ,n, a point of error, as 



regional(2/i) = G repair(2/i) 



cost(M) < cost(M'), VM' G repair(®, u>) 
cost(M) = mini,g„pair(s/p{cost(L)} ^ 



Before to deal with cascaded errors, precipitated by previous erroneous 
repairs, it is necessary to establish the relationship between recovery processes. 
So, given Wi and Wj points of error, j > i, we define the set of for Wi 

in Wj as viable (iCj, iCj) = {xM{y) G regional('u;i)/xM( 2 /) ...Wj prefix for L{A)}. 
Repairs in {wi,Wj) are the only ones capable of ensuring the recognition 

in Wi,,j and, therefore, the only possible at the origin of cascaded errors. In this 
sense, we say that a point of error Wk, k > j is a . Wj 

iff VxM(?/) G viable ( Wj, Wfc), . defining Wi = detection(wj), such that 

scope (M) C This implies that is precipitated by Wj when the region 

defining the point of detection for Wk summarizes all viable repairs for Wj in Wk ■ 
That is, the information compiled from those repairs has not been sufficient to 
give continuity to a process locating the new error in a region containing the 
precedent ones and, as a consequence, depending on these. We then conclude 
that the origin of the current error could be a wrong study of past ones. 



2.3 The Algorithm 

Most authors appeal to global methods to avoid distortions due to unsafe error 
location [5, 6]; but our proposal applies a dynamic estimation of the repair region, 
guided by the linguistic knowledge present in the underlying FA. Formally, we 
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extend the item structure, [p, i, e], where now is the error counter accumulated 
in the recognition of w at position Wi in state p. 

Once located the point of error, we apply all possible transitions beginning 
at its point of detection, which corresponds to the following deduction steps in 

prrnr mnHp T) 'T~)Shift i i 'reinsert i i 'T^Delete i i 'T^Replace i i 'T^ Transpose. 

eiiui muue, -^error -^error ^ -^error ^ -^error ^ -^error ^ -^error 

= {[p, i, e] [q,i + 1, e], 3[a, i]GH, q= p.a} 

®e?for = {[p, i, e] h [p, i + 1, e + /(a)], p.a} 






= {[p,i, e] h [(j, i - 1, e + D{wi 



' M{qo.wi..j) ^n\i , 

p.Wi = qd€ Tilior q^ qd' 



= {[P, b e] h [q, i + 1, e + a)], j ^ 



qd 



= {[p, b e] h [g, i + 2, e + T(wi, Wi+i)] 



M{qo.wi..j) = TZl'l 
p.Wi.Wi+i = q£ TZli or q = qd 



} 



where w\,,j looks for the current point of error. Observe that, in any case, the 
error hypotheses apply on transitions behind the repair region. The process 
continues until a repair covers the repair region. 

In the case of dealing with an error which is not the first one in the word, 
it could condition a previous repair. This arises when we realize that we come 
back to a detection item for which some recognition branch includes a previous 
recovery process. The algorithm re-takes the error counters, adding the cost of 
new error hypotheses to profit from the experience gained from previous repairs. 
This permits us to deduce that if wi is a point of error precipitated by w^, then: 



qq-wi..i < qq.wi„j, M{qo-Wi) = Wj = yi, xM{y) G viable (wfe, W/) 

which proves that the state associated to the point of detection in a cascaded 
error is minor that the one associated to the source of the scope in the repairs 
precipitating it. So, the minimal possible scope of a repair for the cascaded error 
includes any scope of the previous ones, that is, 

max{ (Af), M G viable ( Wfc, ic;)} C max{ (M), M G regional('iCi)} 



This allows us to get an asymptotic behavior close to global repair methods, 
ensuring a quality comparable to those, but at cost of a local one in practice. 



3 An Overview on Galician 

Although Galician is a non-agglutinative language, it shows a great variety of 
morphological processes. The most outstanding features are found in verbs, 
with a highly complex conjugation paradigm, including ten simple tenses. If 
we add the present imperative with two forms, not conjugated infinitive, gerund 
and participle. Then 65 inflected forms are possible for each verb. In addition, 
irregularities are present in both stems and endings. So, very common verbs, 
such as facer ( ), have up to five different stems: fac-er, fag-o, fa-s, 

fac-emos, fix-en. Approximately 30% of Galician verbs are irregular. We have 
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implemented 42 groups of irregular verbs. Verbs also include enclitic pronouns 
producing changes in the stem due to the presence of accents: deu ( ), 

deullelo ( ). 

In Galician the unstressed pronouns are usually suffixed and, moreover, 
pronouns can be easily drawn together and they can also be contracted (lie 
+ o = llo), as in the case of vaitemello buscar ( ( 

)). It is also very common to use what we call a . , 

in order to let the listeners be participant in the action. Therefore, we have 
even implemented forms with four enclitic pronouns, like perdeuchellevolo 
( ). Here, the pronouns che and vos are solidarity pronouns 

and they are used to implicate the interlocutor in the facts that are being told. 
None of them has a translation into English, because this language lacks these 
kinds of pronouns. So, the analysis has to segment the word and return five 
tokens. 

There exist highly irregular verbs that cannot be classified in any irregular 
model, such as ir ( ) or ser ( ); and other verbs include gaps in 

which some forms are missing or simply not used. For instance, meteorological 
verbs such as chover ( ) are conjugated only in third singular person. 

Finally, verbs can present duplicate past participles, like impreso and imprimido 
( ^ 

This complexity extends to gender inflection, with words with only one 
gender as home ( ) and muller ( ), and words with the same form 

for both genders as azul ( ). We also have a lot of models for words with 

separate masculine and feminine forms: autor, autora ( ); xefe, xefa 

( ); poeta, poetisa ( ); rei, raina ( ) or actor, actriz ( ). We 

have implemented 33 variation groups for gender. 

We can also refer to number inflection, with words only being presented 
in singular form, such as luns ( , ), and others where only the plural 

form is correct, as matematicas ( ). The construction of different 

forms does not involve as many variants as is the case for gender, but we 
can also consider a certain number of models: roxo, roxos ( ); luz, luces 

( ); animal, animais ( ); ingles, ingleses ( ); azul, azuis 

( ) or funil, funis ( ). We have implemented 13 variation groups for 

number. 

4 The System at Work 

Our aim is to validate our proposal comparing it with global ones, an objective 
criterion to measure the quality of a repair algorithm since the point of reference 
is a technique that guarantees the best quality for a given error metric when all 
contextual information is available. We choose to work with a lexicon for Galician 
built from GALENA [3], which includes 304.331 different words, to illustrate this 
aspect. The lexicon is recognized by a FA containing 16.837 states connected by 
43.446 transitions, whose entity we consider sufficient for our purposes. 
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4.1 The Operational Testing Frame 

From this lexicon, we select a representative sample of morphological errors to 
its practical evaluation. This can be easily verified from Fig. 1, that shows the 
similar distribution of both the original lexicon and the running sample, in terms 
of lengths of the words to deal with. In each length-category, errors have been 
randomly generated in a number and position in the input string that are shown 
in Fig. 2. This is of importance since, as the authors claim, the performance of 
previous proposals depend on these factors, which has no practical sense. No 
other dependencies have been detected at morphological level and, therefore, 
they have not been considered. 





Fig. 1. Statistics on the general and error lexicons 



In this context, our testing framework seems to be well balanced, from 
both viewpoints operational and linguistic. It remains to decide what repair 
algorithms will be tested. We compare our proposal with the Savary’s global 
approach [6], an evolution of the Oflazer’s algorithm [5] and, in the best of our 
knowledge, the most efficient method of error-tolerant look-up in finite-state 
dictionaries. The comparison has been done from three viewpoints: the size of 
the repair region considered, the computational cost and the repair quality. 

4.2 The Error Repair Region 

We focus on the evolution of this region in relation to the location of the point 
of error, in opposition to static strategies associated to global repair approaches. 
To illustrate it, we take as running example the FA represented in Fig. 3, which 
recognizes the following words in Galician: (sausage), 

(a person who cohabit with another one), (coherent) and 

(you cooperated). We consider as input string the erroneous one , 
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resulting from transpose with in (sausage), and replace the 

character by . We shall describe the behavior from both viewpoints, the 
Savary’s [6] algorithm and our proposal, proving that in the worst case, when 
precipitated errors are present, our proposal can re-take the repair process to 
recover the system from errors in cascade. 

In this context, the recognition comes to an halt on state qg, for which 
M{qg) = TZ‘^ 1 ^ and no transition is possible on . So, our approach locates 
the error at qg and applies from it the error hypotheses looking for the minor 
editing distance in a repair allowing to reach the state 522 • In this case, there 
are two possible regional repairs consisting on first replace by and later 
insert an after (resp. replace by ), to obtain the modification on 
the entire input string (resp. ), which is not a word in our 

running language. 




Fig. 2. Number of items generated in error mode 



As a consequence, although we return to the standard recognition in 922, 
the next input character is now (resp. ), for which no transition is 
possible and we come back to error mode on the region M{q22) = including 
M{qg) = We then interpret that the current error is precipitated by the 

previous one, possibly of type in cascade. As result, none of the regional repairs 
generated allow us to re-take the standard recognition beyond the state 924- At 
this point, M{q24) = 7?.^“ becomes the new region, and the only regional repair 
is now defined as the transposition of the with , and the substitution of 
by ; which agrees with the global repair proposed by Savary, although 
the repair region is not the total one as is the case for that algorithm. This repair 
finally allows the acceptance by the FA. 
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The repair process described is interesting for two reasons. First, it puts into 
evidence that we do not need to extend the repair region to the entire FA in 
order to get the least-cost correction and, secondly, the risk of errors in cascade 
can be efficiently solved in the context of non-global approaches. Finally, in the 
worst case, our running example clearly illustrate the convergence of our regional 
strategy towards the global one from both viewpoints, the computational cost 
and the quality of the correction. 

4.3 The Computational Cost 

These practical results are compiled in Fig. 2, using as unity to measure the 
computational effort the concept of item previously defined. We here consider 
two complementary approaches illustrating the dependence on both the position 
of the first point of error in the word and the length of the suffix from it. So, in 
any case, we are sure to take into account the degree of penetration in the FA at 
that point, which determines the effectiveness of the repair strategy. In effect, 
working on regional methods, the penetration determines the number of regions 
in the FA including the point of error and, as a consequence, the possibility to 
consider a non-global resolution. 




Fig. 3. The concept of region in error repair 



In order to clearly show the detail of the tests on errors located at the end 
of the word, which is not easy to observe from the decimal scale of Fig. 2, we 
include in Fig. 4 the same results using a logarithmic scale. So, both graphics 
perfectly illustrate our contribution, in terms of computational effort saved, from 
two viewpoints which are of interest in real systems: First, our proposal shows 
in practice a linear-like behavior, in opposite to the Savary’s one that seems 
to be of exponential type. In particular, this translates in an essential property 
in industrial applications, the independence of the the time of response on the 
initial conditions for the repair process. Second, in any case, the number of 
computations is significantly reduced when we apply our regional criterion. 
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4.4 The Performance 

However, statistics on computational cost only provide a partial view of the 
repair process that must also take into account data related to the performance 
from both the user’s and the system’s viewpoint. In order to get this, we have 
introduced the following two measures, for a given word, w, containing an error: 

useful items proposed corrections 

performanceyw) = recall[w) = 

total items total corrections 

that we complement with a global measure on the of the error repair 

approach in each case, that is, the rate reflecting when the algorithm provides 
the correction attended by the user. We use the term to refer to the 

number of generated items that finally contribute to obtain a repair, and 

to refer to the number of these structures generated during the process. 
We denote by the number of corrections provided by the 

algorithm, and by the number of possible ones, absolutely. 





Length of the current suffix Position of the first point of error 

Fig. 4. Number of items generated in error mode. Logarithmic scale 



These results are shown in Fig. 5, illustrating some interesting aspects in 
relation with the asymptotic behavior we want to put into evidence in the 
regional approach. So, considering the running example, the performance in our 
case is not only better than Savary’s; but the existing difference between them 
increases with the location of the first point of error. Intuitively this is due to 
the fact that closer is this point to the beginning of the word and greater is the 
number of useless items generated in error mode, a simple consequence of the 
higher availability of different repair paths in the FA when we are working in 
a region close to qo- In effect, given that the concept of region is associated to 
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the definition of corresponding source and drain points, this implies that this 
kind of regions are often equivalent to the total one since the disposition of these 
regions is always concentric. At this point, regional and repair approaches apply 
the same error hypotheses not only on a same region, but also from close states 
given that, in any case, one of the starting points for these hypotheses would be 
qo or a state close to it. That is, in the worst case, both algorithms converge. 

The same reasoning could be considered in relation to points of error 
associated to a state in the recognition that is close to g/, in order to estimate the 
repair region. However, in this case, the number of items generated is greater 
for the global technique, which is due to the fact that the morphology of the 
language often results on the generation of regions which concentrate near of qf, 
a simple consequence of the common derivative mechanisms applied on suffixes 
defining gender, number or verbal conjugation groups. So, it is possible to find a 
regional repair just implicating some error hypotheses from the state associated 
to the point of error or from the associated detection point and, although this 
regional repair could be different of the global one; its computational cost would 
be usually minor. 
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Fig. 5. Performance and recall results 



A similar behavior can be observed with respect to the recall relation. Here, 
Savary’s algorithm shows a constant graph since the approach applied is global 
and, as consequence, the set of corrections provided is always the entire one 
for a fixed error counter. In our proposal, the results prove that the recall is 
smaller than for Savary’s, which illustrates the gain in computational efficiency 
in opposite to the global method. Related to the convergence between regional 
and global approaches, we must again search around points of detection close to 
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the beginning of the word, which often also implies repair regions be equivalent 
to the total one and repairs starting around of qo, such as is illustrated in Fig. 5. 

However, in opposite to the case of performance, we remark that for recall 
the convergence between global and regional proposals seems also extend to 
processes where the point of error is associated to states close to q/, that is, 
when this point is located near of the end of the word. To understand this, it 
is sufficient to take into account that we are not now computing the number of 
items generated in the repair, but the number of corrections finally proposed. 
So, given that closer to the end of the word we are and smaller is the number of 
alternatives for a repair process, both global and regional approaches converge 
also towards the right of the graph for recall. 

Finally, the regional (resp. the global) approach provided as correction the 
word from which the error was randomly included in a 77% (resp. 81%) of 
the cases. Although this could be interpreted as a justification to use global 
methods, it is necessary to remember that we are now only taking into account 
morphological information, which has an impact in the precision for a regional 
approach, but not for a global one that always provide all the repair alternatives 
without exclusion. So, the precision represents, in an exclusively morphological 
context, a disadvantage for our proposal since we base the efficiency in the 
limitation of the search space. The future integration of linguistic information 
from both, syntactic and semantic viewpoints should reduce significantly this 
gap, less than 4%, around the precision; or even should eliminate it. 

5 Conclusion 

The design of computational tools for linguistic usage should respond to the 
constraints of efficiency, safety and maintenance. So, a major point of interest 
in dealing with these aspects is the development of error correction strategies, 
since this kind of techniques supplies the robustness necessary to extend formal 
prototypes to practical applications. In this paper, we have described a proposal 
on spelling correction for Galician, a Latin language with non-trivial morphology 
trying to rescue its recognition from society, which involves to have tools in 
order to ensure a correct usage of it. We take advantage of the grammatical 
structure present in the underlying morphological recognizer to provide the user 
an automatic assistant to develop linguistic tasks without errors. In this sense, 
our work represents an initial approach to the problem, but preliminary results 
seem to be promising and the formalism well adapted to deal with more complex 
problems such as the consideration of additional linguistic knowledge. 
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Abstract. In many languages, abbreviations are widely used either in writing or 
talking. However, abbreviations are likely to be ambiguous. Therefore, there is 
a need for disambiguation. That is, abbreviations should be expanded correctly. 
Disambiguation of abbreviations is critical for correct understanding not only 
for the abbreviations themselves but also for the whole text. Little research has 
been done concerning disambiguation of abbreviations for documents in 
English and Latin. Nothing has been done for the Hebrew language. In this 
ongoing work, we investigate a basic model, which expands abbreviations 
contained in Jewish Law Documents written in Hebrew. This model has been 
implemented in a prototype system. Currently, experimental results show that 
abbreviations are expanded correctly in a rate of almost 60%. 



1 Introduction 

Abbreviations have long been adopted by languages and are widely used either in 
writing or in talking. However, they are not always defined and in many cases they 
are ambiguous. Many authors create their own sets of abbreviations. Some 
abbreviations are specific to certain aspects of a language such as Science, Press 
(newspapers, television, etc.) or Slang. Correct disambiguation of abbreviations may 
affect the proper understanding of the whole text. In general, for any given 
abbreviation this process is composed of the following two main steps: (1) finding all 
possible extensions and (2) selecting the most correct extension. 

Research concerning disambiguation of abbreviations for documents in English 
and Latin has yet to mature and is only investigated in minimal domains (section 2.1). 
Research of this subject is completely absent for the Hebrew language. In Hebrew, the 
task of disambiguation of abbreviations is critical due to the following reasons: 

Hebrew is very rich in its vocabulary of abbreviations. The number of Hebrew 
abbreviations is about 17,000, relatively high comparing to 40,000 lexical entries in 
the Hebrew language. Various kinds of Hebrew texts contain a high frequency of 
abbreviations. For example, Jewish texts, scientific texts, texts in professional 
domains such as: computer sciences, economics, medicine and military. 

People who learn Hebrew in general, immigrants, children and people who need to 
read and learn documents related to a new domain (e.g., a specific professional 
domain) in particular, a lot of help. Many times they do not know what are the 
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possible extensions or part of them and they cannot find the most correct extension. 
Therefore, they experience great difficulty understanding the meaning of the running 
text. 

In this paper, we present the current state of our ongoing work. We develop six 
different basic methods and combinations of them for abbreviation disambiguation in 
Hebrew. The first three methods focused on unique characteristics of each 
abbreviation, e.g.: common words, prefixes and suffixes. The next two methods are 
statistical methods based on general grammar and literary knowledge of Hebrew. The 
final method uses the numerical value (Gematria) of the abbreviation. 

This paper is organized as follows. Section 2 gives background concerning 
disambiguation of abbreviations, previous systems dealing with automatic 
disambiguation of abbreviations and the Hebrew language. Section 3 describes the 
proposed baseline methods for disambiguation of Hebrew abbreviations. Section 4 
presents the experiments we have carried out. Section 5 discusses hard cases of 
abbreviations that our basic methods are not able to disambiguate, gives an example 
for such a hard case and proposes future methods for solving them. Section 6 
concludes and proposes future directions for research. The Appendix contains the 
Hebrew Transliteration Table. 



2 Background 

Abbreviation is a letter or group of letters, which is a shortened form of a word or a 
sequence of words. The word or sequence of words is called a long form of an 
abbreviation. Abbreviation disambiguation means to choose the correct long form 
while depending on its context. 

Abbreviations are very common and are widely used either in writing or in talking. 
However, they are not always defined and in many cases they are ambiguous. 
Disambiguation of abbreviations is critical for correct understanding not only for 
abbreviations themselves but for the whole text. 

Initialism is an abbreviation formed by using the first letters, or initials, of a series 
of words, e.g., "AA" or "ABC". There are some that make a distinction between 
regular initialisms and an acronym. An Acronym exists when the letters form a 
pronounceable word, like “ASCII”, while initialisms are pronounced by sounding out 
each letter, like “CDC” ("see dee see"). In this article we will make no distinction 
between these concepts and refer to them both as initialisms. 

In English, there are several forms for abbreviations and initialisms. For example: 

• Uppercase-Uppercase (e.g., AA is an abbreviation for the following extensions: 
Alcoholics Anonymous, American Airlines, Automobile Association, Anti- 
aircraft, Argentinum Astrum, Alcoa (stock symbol)) 

• Uppercase-lowercase (e.g.: Au means Australia or Austria or Gold) 

• lowercase-lowercase (e.g.: au that means atomic units) 

• Usage of periods and space (e.g.: DC., D.C, D.C., D. C.) 

Additional examples for ambiguous initialisms and their extensions can be 
obtained from [15]. 
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2.1 Previous Automatic Disambiguation of Abbreviations 

Little research has been done in the domain of abbreviation disambiguation. An 
automatic abbreviation dictionary, called S-RAD, has been created automatically for 
biomedicine by Adar [1]. Statistical studies concerning abbreviations in general and 
three-letter abbreviations in particular in medical texts are presented in [10] and [9], 
respectively. 

An automatic system that disambiguates abbreviations in medical abstracts was 
developed by YU et. al. [16]. Their system uses a machine learning technique called 
Support Vector Machines (SVM). SVM is a supervised machine-learning technique 
proposed by Cortes and Vapnik [4]. This method optimizes the error rate on the 
training data set, the ability of the model for prediction, and the ability depends on 
concept VC-dimension. The SVM is applied in various research domains such as 
word sense disambiguation [3] and text classification [8]. In addition, their method 
uses the One Sense per Discourse Hypothesis, which was introduced by Gale et. al. 
[5]. They reported that 94% occurrences of ambiguous words from the same discourse 
have the same meaning. For abbreviation, analogically, when considering its sense as 
its long form, we can observe and assume that when an abbreviation has different 
occurrences within a medical abstract, all of the occurrences have the same long form. 
Experiments show that their method achieves an accuracy of about 84% for selecting 
the most correct long form for a given abbreviation. 

Another working system has been constructed by Rydberg-Cox [11]. This system 
disambiguates abbreviations in early modern texts written in Latin. The system uses a 
three-step algorithm for a given abbreviation. The first two steps identify all possible 
expansions for this abbreviation. In the first stage, they use a Latin morphological 
analyzer in order to parse every possible expansion. In the second stage, they search a 
database of known inflected forms of Latin literature for other possible expansions. In 
the final stage, they use three relatively simple metrics to select the best possible 
expansion: (1) expansions discovered using the Latin morphological analyzer are 
preferred to those discovered using the search in a database of known inflected forms, 
(2) frequency tables for expanded forms in their corpus and (3) search a database of 
collocations to determine whether the current context is similar to other contexts 
where possible expansions of the abbreviation occur in their corpus. In preliminary 
tests in a small corpus, they report accurate expansions for most abbreviations. They 
plan to develop better statistical metrics for the third step and to test their algorithm 
on a larger corpus. 

However, nothing has been done concerning automatic abbreviation 
disambiguation for Hebrew. Abbreviations in Hebrew are widely used and many of 
them are ambiguous. In the next sub-section, we will discuss various aspects of 
Hebrew and abbreviations in Hebrew. 

2.2 The Hebrew Language 

Hebrew is a Semitic language. It uses the Hebrew alphabet and it is written from right 
to left. Hebrew words in general and Hebrew verbs in particular are based on three 
(sometimes four) basic letters, which create the word's stem. The stem of a Hebrew 
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verb is called pl^'^ (bsis, “verb”). The first letter of the stem p (S) is called pe hapoat, 
the second letter of the stem (y) is called ayin hapoal and the third letter of the stem I 
( 7 ) is called lamed hapoal. The names of the letters are especially important for the 
verbs' declensions according to the suitable verb types. 

Except for the word’s stem, there are several other components, which create the 
word’s declensions: 

Conjugations: The Hebrew language contains seven conjugations that include the 
verb’s stem. The conjugations add different meanings to the stem such as: active, 
passive, cause, etc. For example the stem hrs (Din, “destroy”) in one conjugation hrs 
means destroy but in another conjugation nhrs (Dina, “being destroyed”). 

Verb types: The Hebrew language contains several verb types. Each verb type is a 
group of stems that their verbs are acting the same form in different tenses and 
different conjugations. There is a difference in the declensions of the stem in different 
verb types. In English, in order to change the tense, there is a need to add only one or 
two letters as suffixes. However, In Hebrew, for each verb type there is a different 
way that the word changes following the tense. 

To demonstrate, we choose two verbs in past tense of different verb types: (1) ktv 
(DID, “wrote”) of the shlemim verb type (strong verbs - all three letters of the stem are 
apparent), and (2) the word nfl (bSJ, “fell”) of the hasrey _pay_noon verb type (where 
the first letter of the stem is the letter n and in several declensions of the stem this 
letter is omitted). When we use the future tense, the word ktv (DUD, “wrote”) will 
change to ykhtv (DUD’, “will write”) while the second word nfl will change to ypl (bs’, 
“will fall”) which does not include the letter n. Therefore, in order to find the right 
declensions for a certain stem, it is necessary to know from which verb type the stem 
come from. 

Subject: Usually, in English we add the subject as a separate word before the verb. 
For example: I ate, you ate; where the verb change is minimal if at all. However, in 
Hebrew the subject does not have to be a separated word and it can appear as a suffix. 
Prepositions: Unlike English, which has unique words dedicated to expressing 
relations between objects (e.g.: at, in, from), Hebrew has 8 prepositions that can be 
written as a concatenated letter at the beginning of the word. Each letter expresses 
another relation. For example: (1) the meaning of the letter v (1) at the beginning of 
word is identical to the meaning of the word “and” in English. For example, the 
Hebrew word v’t’ (niKI) means “and you”; (2) the meaning of the letter I (b) at the 
beginning of word is similar to the English word “to”. For instance, the Hebrew word 
lysr’l (bKIwb) means “to Israel”. 

Belonging: In English, there are some unique words that indicate belonging (e.g.: my, 
his, her). This phenomenon exists also in Hebrew. In addition, there are several 
suffixes that can be concatenated at the end of the word for that purpose. The meaning 
of the letter y (’’) at the end of word is identical to the meaning of the word “my” in 
English. For example, the Hebrew word ty (’DV) has the same meaning as the English 
words “my pen”. 

Object: In English, there are some unique words that indicate the object in the 
sentence, such as: him, her, and them. This is also the case in Hebrew. In addition. 



' See Appendix for the Hebrew Transliteration Table. 

^ Each Hebrew word is presented in three forms: (1) transliteration of the Hebrew letters 
written in italics, (2) the Hebrew letters, and (3) its translation into English in quotes. 
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there are several letters that can be concatenated at the end of the word for that 
purpose. The letter v (1) at the end of a word has the same meaning as the word him in 
English. For example, the Hebrew word r’ytyv (TTI’KI) has the same meaning as the 
English words “I saw him”. 

Terminal letters: In Hebrew, there are five letters: m (:3), n (]), ts (U), p (S), kh (D) 
which are written differently when they appear at the end of word: m (D), n (]), ts (f), 
p (fl), kh (1) respectively. For example, the verb ysn (IW’, “he slept”) and the verb 
ysnty (TIJW’, “I slept”). The two verbs have the same stem ysn, but the last letter of the 
stem is written differently in each one of the verbs. 

In Hebrew, it is impossible to find the declensions of a certain stem without an 
exact morphological analysis based on the features mentioned above. The English 
language is richer in its vocabulary than Hebrew (the English language has about 
40,000 stems while Hebrew has only about 4,000 and the number of lexical entries in 
the English dictionary is 150,000 compared with only 40,000 in the Hebrew 
dictionary), but the Hebrew language is richer in its morphology forms. For example, 
the single Hebrew word vlkhsykhuhu (imD’WDhl) is translated into the following 
sequence of six English words: “and when they will hit him”. In comparison to the 
Hebrew verb, which undergoes a few changes, the English verb stays the same. 

In Hebrew, there are up to seven thousand declensions for only one stem, while in 
English there are only a few declensions. For example, the English word eat has only 
four declensions (eats, eating, eaten and ate). The relevant Hebrew stem ‘khl ('?DK, 
”eat”) has thousands of declensions. Ten of them are presented below: (1) ‘khlty 
(’nhDK, “I ate”), (2) ‘kbit (nhDK, “you ate”), (3) ‘khlnv (IJhDK, “we ate”), (4) ‘khvl (’7D1K, 
“he eats”), (5) ‘khvlym (D’’7D1K, “they eat”), (6) ‘tkhl C7DKn, “ she will eat”), (7) l‘khvl 
('7DK'7, “to eat”), (8) ‘khltyv (vnhDK, “I ate it”), (9) v'khlty (TihDKI, “and 1 ate”) and (10) 
ks‘khlt (n^JaKlffD, “when you ate”). 

For more detailed discussions of Hebrew grammar from the viewpoint of computer 
science, refer to [13]. For Hebrew grammar, refer either to [6, 12] in English or to 
[15] in Hebrew. 

Gematria (Numerology). Numerology is an arcane study of the purported mystical 
relationship between numbers and the character or action of physical objects and 
living things. Numerology and numerological divination was popular among early 
mathematicians such as Pythagoras. 

These methods derived from the basic need of keeping count. This system was 
widely used in Mesopotamia. In this case, numerical values were assigned to the 
characters in their syllabary, and the numerical values of names were computed. The 
Greeks decided to adopt this method and called it isopsephia. 

In this system, the first letter of the alphabet is used as the numeral One, the second 
letter as Two and so on, until the ninth letter is assigned to Nine. Then you start on the 
Tens, assigning the 10th letter to Ten, the 11th letter to 20, and so on, until you reach 
the 18th letter which is assigned to 90. Then, you count in hundreds. The 19th letter is 
used as a symbol for 100, the 20th letter for 200 and so until you reach the 27th letter 
and the number 900. This system has since been in wide use in several common 
languages: Hebrew, Greek, Arabic and Chinese. The Greek alphabet has only 24 
letters, so it uses three ancient characters. Digamma or Fau, Qoppa and Tsampi which 
had dropped out of use as letters as the numerals 6, 90 and 900. The Arabic alphabet 
has 28 characters, one of which (Chain) is used as a symbol for 1000. 
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The system of Hebrew numerals is called Gematria. Gematria is the calculation of 
the numerical equivalence of letters, words or phrases. Thus, gaining insight into 
interrelation of different concepts and exploring the interrelationship between words 
and ideas. There are more than ten ways to calculate equivalence of individual letters 
in Gematria. We used one of the most common methods: Absolute Value - also 
known as Normative Value. Each letter is given the value of its accepted numerical 
equivalent: (alef, the first letter) equals 1, (belt, the second letter) equals 2, and so 

on. The tenth letter, ' fyud) is numerically equivalent to 10, and successive letters 
equal 20, 30, 40, and so on. The letter ^ (kuf) near the end of the alphabet, equals 
100; and the last letter, ^ (tav) equals 400. 

In this reckoning, the letters 1 (final chaf), ^ (final mem), '' (final nun), (final 
pei), and (final tzadik) which are the "final forms" of the letters ^ (chaf), ^ (mem), 
(nun), -■ (pei), and (tzadik), used when these letters conclude a word, generally 
are given the same numerical equivalent of the standard form of the letter. 

This system is used nowadays mainly for specifying the days and years of the 
Hebrew calendar as well as chapter and page numbers. 

2.3 Abbreviations in Hebrew 

In Hebrew, there are about 17,000 common abbreviations, not including unique 
professional abbreviations [2]. About 35% of them are ambiguous. That is, about 
6000 abbreviations have more than one possible extension for each abbreviation. 
Every Hebrew abbreviation contains quotation marks between the two last letters of 
the abbreviation. 

An example of an extremely ambiguous abbreviation is (K"K, “AA”). This 
abbreviation has 110 possible long forms [2]. Ten of them are presented below: (1) 
(amaK laK, “Abraham said”), (2) (“IWSK ’K, “not possible”), (3) (W’K nWK, “married 
woman”), (4) (irmaK maK, “our fore fathers”), (5) (’aK ’aK , “my father’s father”), (6) 
(’aK DK , “my father’s mother”), (7) (D’'7aiK fK, “eating disallowed”), (8) (D’aaiK pK, 
“saying disallowed”), (9) (D’aaiK DK, “if said”) and (10) (aaiK nnK, “you say”). 



3 Baseline Methods for Abbreviation Disambiguation in Hebrew 

In this section, we introduce the baseline methods we use for abbreviation 
disambiguation in Hebrew. Our methods only disambiguate abbreviations with 
sequences of words as their long forms. Our methods are classified into two kinds: the 
first set of methods is based on research of our corpus. This research focused on 
unique characteristics of each abbreviation and unique characteristics of the Hebrew 
language. Eor instance: prefixes, suffixes, common words which appear in the vicinity 
of the abbreviation and numerical value of the abbreviation. 

The second set of methods is composed of statistical methods based on general 
grammar and literary knowledge of Hebrew, e.g.: context related words that may 
appear in the vicinity of the abbreviation and probability of disambiguation solution 
based on the field of discussion in the context. Currently, we have not addressed 
semantic methods dealing with the meaning of abbreviations, words and sentences. 
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So far, the above sets of methods have produced six baseline methods. The first 
four methods belong to the first set and the two last methods belong to the second set. 
The definition of these methods is as follows: 

1) Word Before (WB): this method tries to conclude the proper disambiguation long 
form based on the word that appears immediately before the abbreviation in the 
sentence. The credibility of this method is based on the assumption that certain words 
may appear before the abbreviation a considerable amount of times based on literal 
and syntax relationships in Hebrew. Many words may be uniquely related to a single 
abbreviation. For example in the following Hebrew sentence: 

"17’ hv nriTO ’"K ’"V rih’nnD*? nnmiin niDKhlim" - the abbreviation ‘y (’"K) may be 
interpreted as (’7117’ 11’K, “non Jew”) or (hKIlS” f7K, “land of Israel”). Based on the 
word before y (’"V, “performed by”) we can understand that the sentence refers to a 
person performing an action. Therefore, the correct solution would be “non Jew”. 

2) Word After (WA): this method tries to conclude the proper disambiguation 

solution based on the word that appears immediately after the abbreviation in the 
sentence. The credibility of this method is based on the assumption that certain words 
may appear after the abbreviation a considerable amount of times based on literal and 
syntax relationships in Hebrew. Many words may be uniquely related to a single 
abbreviation. For example in the following Hebrew sentence: riKWh t"D 

mm" "...a’l’V ’nan hw p’nn - the abbreviation kz (I"D) may be interpreted as (nt ha, “all 
this”) or 27 (the numerical value of the Hebrew characters 3 & t). Based on the word 
after myyry (’n”a, “is discussed”) we can understand that the sentence is trying to 
explain to us which cases are discussed. Therefore, the correct disambiguation would 
be ‘all this’. 

3) Prefixes (PR): this method tries to conclude the proper disambiguation solution 
based on the prefixes added to the abbreviation (Section 2.2). The credibility of this 
method is based on knowledge of syntax rules in Hebrew, which in certain cases 
impose specific prefixes to be added. Many prefixes may be uniquely related to a 
single abbreviation. For example in the following Hebrew sentence: h'hai D’7D1K 1£”1" 
"K"’0a - the abbreviation y’ (K"’) may be interpreted as (D’7aiK IS”, “some say”) or 11 
(numerical value of the Hebrew characters ’ & K). The prefix letter s (D) adds the word 
syf (q’VD, “paragraph”) before the abbreviation. Therefore, the paragraph number is 
the correct solution, i.e. 1 1 . There is no meaning to the second long form. 

4) Numerical Value - Gematria (GM): this method tries to conclude the proper 
disambiguation solution based on the numerical value of the abbreviation. In essence, 
every abbreviation has a numerical value solution. For example in the following 
Hebrew sentence: "a"aa DVan t"’D 1"D7 la’Oa laphaai" - the abbreviation rsg (l"D7) 
may be interpreted as (11K1 n’7yo ’37, “a Jewish Scholar”) or 273 (the numerical value 
of this abbreviation). This method will choose 273 as the disambiguation solution. 

5) Most Common Extension in the Text (CT): this method tries to conclude the 
proper disambiguation solution based on the statistical frequency of each solution in 
the context of the text file. For example in the following Hebrew sentence: 

"ania ’"K ’"V1 v'^ab K3’K KP’a K71D’K s"av" - the abbreviation ‘y (’"K) may be 
interpreted as (’717’ 11’K, “non Jew”) or C7K71£” f7K, “land of Israel”). The most 
common long form in the context of the discussed text file is (’717’ 11’K, “non Jew”) 
and is therefore the disambiguation solution. 
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6) Most Common Extension in the Language (CL): this method tries to conclude 
the proper disambiguation solution based on the statistical frequency of each solution 
in the Hebrew language. For example in the following Hebrew sentence: 

"T’)>S7 nmn nrK it miVD !3"!37 S771" - the abbreviation mm may be interpreted as 
(D1p13 bDti, “anyhow”) or (’DTitl “a name of a Jewish law book”) or (rnwi3 

“another name of a Jewish law book”). The most common long form in the Hebrew 
language is (aip13 "fDa, “anyhow”) and is therefore the disambiguation solution. 



4 Experiments 

We have constructed a relatively small corpus, taken from [7]. This corpus discusses 
Jewish laws referring to Saturday. It contains about 19,000 Hebrew words including 
about 1500 abbreviations of them 557 are ambiguous. Each abbreviation has, on 
average, 2.3 possible long forms. We used each of the above baseline methods 
independently to detect and disambiguate abbreviations that have more than two longs 
forms. Table 1 shows the results of our six methods acting independently on 
ambiguous abbreviations, regarding correct disambiguation: count and percentage. 



Table 1. Summarization of the results for the methods acting independently 



# 


Method 


Correct Disambiguation 






# 


% 


1 


WB 


235 


42.2 


2 


WA 


245 


44 


3 


PR 


253 


45.4 


4 


GM 


48 


8.7 


5 


CT 


320 


57.5 


6 


CL 


239 


42.9 



The best baseline method: CT presents a rate of about 57.5%. This means that the 
statistical frequencies of the long forms contained in the text file is the best method. 

We also combined the above baseline methods to work concurrently. The number 
of combinations tested was 57 (there are 15 possibilities to combine 2 methods 
together, 20 possibilities to combine 3 methods together, 15 possibilities to combine 4 
methods together, 6 possibilities to combine 5 methods together and an 1 experiment 
was done using all 6 baseline methods at once). 

In these experiments, each baseline method returned the abbreviation 
disambiguation solution best suited to the method. Each solution was given equal 
weight in the consideration of the final solution. When a decision was needed 
between 2 equally possible solutions, the first of the solutions to appear in the solution 
possibility set was chosen. Table 2 shows the results of our methods working in 
different combinations on ambiguous abbreviations, regarding correct 
disambiguation: count and percentage. 
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Table 2. Summarization of the results for combinations of methods acting together 



# 


Method 


Correct Disambiguations 






# 


% 


1 


PR-WB-WA-CT 


327 


58.7 


2 


PR-WB-WA-CT-CL 


324 


58.1 


3 


PR- WB - WA-CT-CL-GM 


322 


57.9 


4 


CT-GM 


306 


55.1 


5 


CL-CT 


295 


53 


6 


PR-WB-CL 


293 


52.6 


7 


PR-CL 


286 


51.3 



The best combination of baseline methods: PR-WB-WA-CT presents a rate of 
almost 60%. The CL method - the statistical frequencies known in the Hebrew 
language - and the GM method - the numerical value of the abbreviation - are the 
only methods that are not included in the combination of the methods. It might be a 
supporting evidence that the most important methods to disambiguate correctly are 
based on internal-file information such as the CT method and all the first three 
methods that discuss one word before, one word after and prefixes. It may also 
support the claim that Gematria should not receive equal weight since it only has 
potential to have a real meaning when the letters of the abbreviation are written in 
descending value and not in any kind of abbreviation. 



5 Hard Cases 

Analyzing the results from the above experiments led us to understand what the hard 
cases that we could not solve were. A very potent characteristic of these cases was the 
semantic context of the initialisms. Since our current methods are static, we could not 
correctly disambiguate this kind of initialism. 

An example of such a case is given with two subsequent sentences: 

Sentence A: ’7T’Dn nniK K"K riK Yi'^KW" (I questioned A”A about the trip tomorrow) 

Sentence B: '"7013 'jT’Dnw ’h K’h nnJV" (She answered me that the trip was canceled) 

Let us Assume that the two most suitable solutions for the abbreviation ” (K"K, 
A” A) in Sentence A are extensions # 5 & 6 mentioned in the last paragraph on 
Section 2. That is, the potential solutions are: (1) (’3K ’3K, “my father’s father”) or (2) 
(’3K DK, “my father’s mother”). We can clearly conclude that the correct solution is 
the second one - “my father’s mother”, since the second sentence tells us that the 
questioned person in the first sentence is a female. This is clearly a semantic 
understanding of the sentences and it is based on multiple sentence diagnostics. 

Our research does not include this type of natural language understanding, but we 
may develop additional methods and their combinations, which will help solve many 
hard cases. 

We also propose a dynamic method that will be implemented and tested in the near 
future. This method uses static technical analysis to help indirectly solve part of these 
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hard cases. After applying the above six static methods on all the initialisms in the 
text, we extract all the “sure thing” cases (ambiguous abbreviations with 100% 
certainty of disambiguation solution) based on these methods’ analysis. We then add 
to a result file, created for each pair consisted of an initialism and a specific potential 
solution, all the words in the sentence in which the initialism occurs in (not including 
‘stop-words’). With each solution we enlarge our word collection file. After the entire 
file is analyzed once, we then reiterate through all the “unsure” cases (2 or more 
disambiguation solutions are equally possible) in the text. We examine the sentence 
that contains the initialism and count the amount of words from the sentence that 
appear in each of the word collection files for each solution. The solution with the 
highest number of common words in the examined sentence is selected. 

This method is truthful to the language characteristic of reoccurring words in the 
vicinity of others. This method also adds a dynamic dimension to our research and may 
give different results for different texts based on the “sure thing” cases of the text. 



6 Conclusions, Summary and Future Work 

Our basic system is the first to disambiguate abbreviations for Hebrew text-files. It 
presents a rate of about 58.5% using the best baseline method: CT. This means that 
the statistical frequencies regarding the context of the text file forms the best 
independent method. 

The best combination of baseline methods presents a rate of about 60%. The 
methods in this combination are based on internal-file information: CT and the 
methods that discuss one word before, one word after and prefixes. 

Future directions for research are: (1) Finding new baseline methods and combining 
between these and previous baseline methods, (2) Enlarging the different text file types 
and further research of new abbreviations, (3) Using learning techniques in order to find 
the best weighted combinations of methods and (4) Elaborating the model for 
abbreviation disambiguation for various kinds of Hebrew documents. 
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Appendix 

The Hebrew Transliteration Table presented below, which has been used in this 
paper, is taken from the web site of the Princeton university library (http://infosharel. 
princeton.edu/katmandu/hebrew/trheb.html). 

Consonants 



Vernacular 


Romanization 


1 


(alif) 

or disregarded 




b 




V {in Yiddish, b) 




g 




d 


i 


h 




( only if a 

consonant) 




\J^nly if a 

consonant) 




z 


t 




I 






y (only if a 
consonant) 


^final Ljfc 


k 



final 


kh 




1 


tinal 


m 


final Dp 


n 


1 


s 


t 


(ayin) 


E^nal 


P 


[^nal Ljk 


f 


final 


ts 


£ 






r 


1 


sh 


1 




[ 


t 


[ 


t {in Yiddish ) 
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Vowels 



Hebrew 


Romanization 


c 


3 


A 


[ 


a 


a or 0 


□ 


E 


[ 


a 


E 


[ 


a 


I 


[ 


a 


O 




U 


d 


3i 


E 







Ai 






E 


dS 


I 


D, 


0 




u 


a 


e or 

disregarded 


E 


a 


A 




E 




0 
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Abstract. In this paper, we re-visit the foundations of the statistical 
approach to machine translation and study two forms of the Bayes de- 
cision rule: the common rule for minimizing the number of string errors 
and a novel rule for minimizing the number of symbol errors. The Bayes 
decision rule for minimizing the number of string errors is widely used, 
but its justihcation is rarely questioned. 

We study the relationship between the Bayes decision rule, the under- 
lying error measure, and word confidence measures for machine transla- 
tion. The derived confidence measures are tested on the output of a state- 
of-the-art statistical machine translation system. Experimental compar- 
ison with existing confidence measures is presented on a translation task 
consisting of technical manuals. 



1 Introduction 

The statistical approach to machine translation (MT) has found widespread use. 
There are three ingredients to any statistical approach to MT, namely the Bayes 
decision rule, the probability models (trigram language model, HMM, ...) and 
the training criterion (maximum likelihood, mutual information, ...). 

The topic of this paper is to examine the differences between string error (or 
error) and symbol error (or error) and their implications for the 
Bayes decision rule. The error measure is referred to as loss function in statistical 
decision theory. We will present a closed representation of different word error 
measures for MT. For these different word error measures, we will derive the 
posterior risk. This will lead to the definition of several confidence measures at 
the word level for MT output. 

Related Work: For the task of MT, statistical approaches were proposed at the 
beginning of the nineties [3] and found widespread use in the last years [12, 14]. 
To the best of our knowledge, the ’standard’ version of the Bayes decision rule, 
which minimizes the number of sentence errors, is used in virtually all approaches 
to statistical machine translation (SMT). There are only a few research groups 
that do not take this type of decision rule for granted. 



J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 70-81, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




Bayes Decision Rules and Confidence Measures 



71 



In [8], an approach to SMT was presented that minimized the posterior risk 
for different error measures. Rescoring was performed on 1,000-best lists pro- 
duced by an SMT system. In [11], a sort of error related or discriminative train- 
ing was used, but the decision rule as such was not affected. In other research 
areas, e.g. in speech recognition, there exist a few publications that consider the 
word error instead of the sentence error for taking decisions [6] . 

2 Bayes Decision Rule for Minimum Error Rate 

2.1 The Bayes Posterior Risk 

Knowing that any task in natural language processing (NLP) is a difficult one, 
we want to keep the number of wrong decisions as small as possible. To classify 
an observation vector y into one out of several classes c, we resort to the so-called 
statistical decision theory and try to minimize the posterior . R(c\y) in taking 
a decision. The posterior risk is defined as 

R{c\y) ='^Pr{c\y) ■ L[c,c] , 



where L[c, c] is the so-called or , i.e. the loss we incur 

in making decision c when the true class is c. The resulting decision rule is known 
as , [4] : 

y —> c = arg min i?(c|j/) = argmin E Pr{c\y) ■ L[c, c] 

K c 

In the following, we will consider two specific forms of the error measure, 
L[c, c]. The first will be the measure for errors, which is the typical loss 

function used in virtually all statistical approaches. The second is the measure 
for errors, which is the more appropriate measure for machine translation 
and also speech recognition. 

In NLP tasks such as Part-of-Speech tagging, where we do not have the 
alignment problem, the optimal decision is the following: compute the Bayes 
posterior probability and accept if the probability is greater or equal to 0.5. We 
omit the proof here. Following this, we formulate the Bayes decision rule for two 
different word error measures in MT. From those we can derive word confidence 
measures for MT according to which the words in MT output can be either 
accepted as correct translations or rejected. 

2.2 Sentence Error 

For machine translation, the starting point is the observed sequence of words 
y = f( = /i.../j, i.e. the sequence of words in the source language which has to 
be translated into a target language sequence c = e{ = ei...e/. 

The first error measure we consider is the sentence error: two target language 
sentences are considered to be identical only when the words in each position 
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are identical (which naturally requires the same length I). In this case, the error 
measure between two strings e{ and e{ is: 

/ 

i[e(,e(] = 1 - S{I,i) • , 

with the Kronecker delta In other words, the errors are counted at the 

(or sentence) level and not at the level of single symbols (or words). 
Inserting this cost function into the Bayes risk (see Section 2.1), we obtain the 
following form of . : 

fi U.e() = argmax{Pr(/,e(|//)} 

= argmax{Pr(/,e(,//)} . (1) 

This is the starting point for virtually all statistical approaches in machine 
translation. However, this decision rule is only optimal when we consider 
error. In practice, however, the empirical errors are counted at the level. 
This inconsistency of decision rule and error measure is rarely addressed in the 
literature. 

2.3 Word Error 

Instead of the error rate, we can also consider the error rate of , 

or single . In the MT research community, there exist several different error 
measures that are based on the word error. We will investigate the 
( ) and the ( ). 

The symbol sequences in Figure 1 illustrate the differences between the two 
error measures WER and PER: Comparing the strings ’ABCBD’ and ’ABBCE’, 
WER yields an error of 2, whereas the PER error is 1. 

WER: PER: 

string 1 ABCBD ABCBD 

I I \ I I X 

string 2 ABBCE ABBCE 

Fig. 1. Example of the two symbol error measures WER and PER: The string ’ABCBD’ 
is compared to ’ABBCE’ 

For NLP tasks where there is no variance in the string length (such as Part-of- 
Speech tagging), the integration of the symbol error measure into Bayes decision 
rule yields that a maximization of the posterior probability for each position i has 
to be performed [10]. In machine translation, we need a method for accounting 
for differences in sentence length or word order between the two strings under 
consideration, e.g. the Levenshtein alignment (cf. WER). 
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— WER (word error rate): The word error rate is based on the Leven- 
shtein distance [9]. It is computed as the minimum number of substi- 
tution, insertion, and deletion operations that have to be performed to 
convert the generated sentence into the reference sentence. For two sen- 
tences e{ and e(, the Levenshtein alignment is denoted by £(e(,e(); for a 
word Ci, the Levenshtein aligned word in e{ is denoted by £i(e{,e{) for 

In order to keep the presentation simple, we only consider substitutions 
and deletions of words in e{ and omit insertions in e{. The error measure is 
defined by 

I 

This yields the posterior risk 

/ 

= ^ Pr(/,g{|/0 • ^ [l - ^(e., A(e{,e{)) 
i,e{ 

/ / 

= E E 

In Section 3.2 we will see that this is related to the word posterior 
probabilities introduced in [15]. The 
is obtained by minimizing the risk. 

— PER (position independent word error rate): A shortcoming of the WER 
is the fact that it does not allow for movement of words or blocks. The 
word order of two target sentences can be different even though they are 
both correct translations. In order to overcome this problem, the position 
independent word error rate compares the words in the two sentences 

taking the word order into account. Words that have no matching 
counterparts are counted as substitution errors, missing words are deletion 
and additional words are insertion errors. The PER is always lower than or 
equal to the WER. 

To obtain a closed-form solution of the PER, we consider for each word 
e = 1 . . . E in the target vocabulary the number Ug of occurrences in sentence 

I 

e{, i.e. n-e = ^ 6{ei,e). The number of occurrences of word e in sentence 

i=l 

e{ is denoted by fig, respectively. The error can then be expressed as 
L[e{, e{] = max(J, /) — E] min(ng, hg) . 

e 

Thus, the error measure depends only on the two sets of counts 
nf := ni ... Ug .. . ue and nf := fii . . .fig . . . tie. The integration of this 
error measure into the posterior risk yields [16] 
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\ne - he\ ■ Preiflelfi) + ^ J2\I-i\-Pr{m (2) 
e ne j 

where Pre{ne\fi) is the posterior probability of the count rig of word e. 



3 Confidence Measnres for Machine Translation 

3.1 Introduction 

In many applications of machine translation, a method for labeling the generated 
words as either correct or incorrect is needed. To this purpose, each word in 
the generated target sentence is assigned a so-called confidence measure. This 
confidence measure can be used e.g. in interactive systems to report possible 
errors to the user or to propose translations only when they are likely to be 
correct. 

Confidence measures have been extensively studied for speech recognition, 
but are not well known in other areas. Only recently have researchers started to 
investigate confidence measures for machine translation [1, 2, 5, 15]. 

We apply word confidence measures in MT as follows: For a given translation 
produced by an MT system, we calculate the confidence of each generated word 
and compare it to a threshold. All words whose confidence is above this threshold 
are tagged as correct and all others are tagged as incorrect translations. As stated 
before, this approach is related to the minimization of the expected number of 
errors instead of sentence errors. 

In this section, we will shortly review some of the word confidence measures 
that have proven most effective, and show their connection with the Bayes risk 
as derived in Section 2.1. In addition, we will introduce new confidence measures 
and give an experimental comparison of the different methods. 

3.2 Word Posterior Probabilities 

In [15], different variants of word posterior probabilities which are applied as 
word confidence measures are proposed. We study three types of confidence 
measures: 

Target Position: One of the approaches to word posterior probabilities pre- 
sented in [15] can be stated as follows: the posterior probability Pi{e\fi , I, e{\ci) 
expresses the probability that the target word e occurs in position i (given the 
other words in the target sentence e{\ei). In Section 2.3, we saw that the (mod- 
ified) word error measure WER directly leads to this word posterior probability. 
Thus, we study this word confidence measure here. 

The word posterior probability can be calculated over an iV-best list of al- 
ternative translations that is generated by an SMT system. We determine all 
sentences that contain the word e in position i (or a target position Levenshtein 
aligned to i) and sum their probabilities, i.e. 
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where 



P*(e|//,/, e{ \ 6i) 
\ Si) 



Pz(e,ffJ, e{ \ Si) 
J2Pi{e'Ji,i,S{ \ e*) 

e' 

^ p(i,eiJ^) . 

i ,el:Ci{el,e[)=e 



( 3 ) 



This probability depends on the target words e{ \ Si in the generated string, 
because it is based on the Levenshtein alignment Ci{S{,e{). 



Average Target Position: Due to the reordering of words which takes place 
in translation, the same target word may appear in different positions in the 
generated translations. The word posterior probabilities based on target posi- 
tions presented above partially compensate for this effect by determining the 
Levenshtein alignment over the iV-best list. Nevertheless, this cannot handle all 
reordering within the sentence that may occur. Therefore, we also introduce a 
new version of word posterior probabilities that determines the over all 

posterior probabilities based on target positions: 



Pavg(e|// ) 



Pavg(e, fi) 



EPavg(e',//) ’ 



Pavg(e,//)=^ ^ p(I,e(,//) (4) 

/>2, e^-.Ci—e 



where I* is the maximum of all generated sentence lengths. The idea is to deter- 
mine the probability of word e occurring in a generated sentence at all - without 
regarding a fixed target position. Note that here no Levenshtein alignment is 
performed, because the variation in sentence positions is accounted for through 
the computation of the arithmetic mean. 



Word Count: In addition to the word posterior probabilities described above, 
we also implemented a new variant that can be derived from Eq. 2 (Sec. 2.3), 
taking the counts of the words in the generated sentence into account, i.e. we 
determine the probability of target word e occurring in the sentence rig 
times: 



where 



Pe{ne\fi) 



PejrieJi) 

'EPeiKJl) 



Peine, fi)= ) 

nf :rie=ne 



^ p(/,e{,//). (5) 



Implementation: As already stated above, the word posterior probabilities 
can be calculated over A-best lists generated by an SMT system. Thus, the sum 
over all possible target sentences e( is carried out over the alternatives contained 
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in the N-hest list. If the list is long enough, this approximation is not harmful. 
In our experiments, we used 1,000-best lists. ^ 

Since the true probability distribution Pr{I,e{, fi) is unknown, we replace 
it by a fi). This model distribution is the one from 

the SMT baseline system (see Section 4.2). 



3.3 IBM-1 

We implemented another type of confidence measure that determines the trans- 
lation probability of the target word e averaged over the source sentence words 
according to Model 1 introduced by IBM in [3]. We determine the probability 
according to the formula^ 



PiBM-i(e|//) 



PiBM-i(e, fi) 



PiBM-i{e,fi) = j^'^p{e\fj) ( 6 ) 



where /o is the ’empty’ source word [3]. The probabilities p{e\fj) are word based 
lexicon probabilities, i.e. they express the probability that e is a translation of 
the source word fj. 

Investigations on the use of the IBM-1 model for word confidence measures 
showed promising results [1,2]. Thus, we apply this method here in order to 
compare it to the other types of confidence measures. 



4 Results 

4.1 Task and Corpus 

The experiments are performed on a French-English corpus consisting of techni- 
cal manuals of devices such as printers. This corpus is compiled within the Eu- 
ropean project TransType2 [13] which aims at developing computer aided MT 
systems that apply statistical techniques. For the corpus statistics see Table 1. 

4.2 Experimental Setting 

As basis for our experiments, we created 1,000-best lists of alternative trans- 
lations using a state-of-the-art SMT system. The system we applied is the so- 
called alignment template system as described in [12]. The key elements of this 
approach are pairs of source and target language phrases together with an align- 
ment between the words within the phrases. 



^ In the area of speech recognition, mnch shorter lists are used [7]. The jnstification is 
that the probabilities of the hypotheses which are lower in the list are so small that 
they do not have any effect on the calculation of the word posterior probabilities. 
Nevertheless, we use longer A-best lists here to be on the safe side. 

^ Note that this probability is different from the one calculated in [1]; it is normalized 
over all target words e! . Nevertheless, both measures perform similarly well. 
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Table 1. Statistics of the training, development and test set 





French | English 


Training Sentences 

Words -I- Punctuation Marks 
Vocabulary Size 

Singletons 


52 844 


691983 


633 770 


14 831 


13201 


4 257 


3 592 


Develop Sentences 

Words 


994 


11731 1 10 903 


Test Sentences 

Words 


984 


11800 1 11 177 



This system - like virtually all state-of-the-art SMT systems - applies the 
Bayes decision rule in Eq. 1, i.e. it takes the decision based on error. 

4.3 Word Error Measures 

It is not intuitively clear how to classify words in MT output as correct or 
incorrect when comparing the translation to one or several references. In the 
experiments presented here, we applied WER and PER for determining which 
words in a translation hypothesis are correct. Thus, we can study the effect of 
the word posterior probabilities derived from the error measures in Section 2.3 
on the error measures they are derived from. 

These error measures behave significantly different with regard to the per- 
centage of words that are labeled as correct. WER is more pessimistic than PER 
and labels 58% of the words in the develop and test corpus as correct, whereas 
PER labels 66% as correct. 

4.4 Evaluation Metrics 

After computing the confidence measure, each generated word is tagged as ei- 
ther or , depending on whether its confidence exceeds the tagging 

threshold that has been optimized on the development set beforehand. The per- 
formance of the confidence measure is evaluated using three different metrics: 

— CAR (Confidence Accuracy Rate): The CAR is defined as the number of 

correctly assigned tags divided by the total number of generated words in the 
translation. The baseline CAR is given by the number of correct words in the 
generated translation, divided by the number of generated words. The CAR 
strongly depends on the tagging threshold. Therefore, the tagging threshold 
is adjusted on a development corpus from the test set. 

— ROC (Receiver Operating Characteristic curve) [4]: The ROC curve plots 

the versus for different values of 

the tagging threshold. The correct rejection rate is the number of incorrectly 
translated words that have been tagged as wrong, divided by the total num- 
ber of incorrectly translated words. The correct acceptance rate is the ratio 
of correctly translated words that have been tagged as correct. These two 
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rates depend on each other: If one of them is restricted by a lower bound, the 
other one cannot be restricted. The further the ROC curve lies away from 
the diagonal, the better the performance of the confidence measure. 

AROC (Area under ROC curve) : This value specifies twice the size of the 
area between the ROC curve and the diagonal; it ranges from 0 to 1. The 
higher this value, the better the confidence measure discriminates. 



4.5 Experimental Results 

We studied the performance of the word confidence measures described in Sec- 
tions 3.2 and 3.3. The results are given in Tables 2 and 3. For both the word 
error measure PER as well as WER, the best performance in terms of CAR is 
achieved by the IBM-1 based word confidence measure. 



Table 2. CAR [%] and AROC [%] on the test corpns. The error connting is based on 
PER (see Sec. 4.3) 





CAR 


AROC 


Baseline 


64.4 


- 


averaged target position (Eq. 4) 


64.8 


6.6 


target position (Eq. 3) 


67.2 


28.3 


word counts (Eq. 5) 


68.1 


29.8 


IBM-1 (Eq. 6) 


71.6 


21.5 



Table 3. CAR [%] and AROC [%] on the test corpns. The error connting is based on 
WER (see Sec. 4.3) 





CAR 


AROC 


Baseline 


55.9 


- 


averaged target position (Eq. 4) 


59.3 


9.1 


target position (Eq. 3) 


64.1 


26.4 


word counts (Eq. 5) 


62.0 


20.8 


IBM-1 (Eq. 6) 


66.3 


18.7 



When comparing the CAR for the word posterior probabilities given in 
Eqs. 3,4,5 and the IBM-1 based confidence measure, it is surprising that the 
latter performs significantly better (with regard to both word error measures 
WER and PER). The IBM-1 model is a very simple translation model which 
does not produce high quality translations when applied in translation. Thus it 
was interesting to see that it discriminates so well between good and bad trans- 
lations. Moreover, this method relies only on one target hypothesis (and the 
source sentence), whereas the word posterior probabilities take the whole space 
of possible translations (represented by the A-best list) into account. 
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In contrast to the good performance in CAR, the IBM-1 based confidence 
measure yields a much lower AROC value than two of the other measures. Look- 
ing at the ROC curve^ in Figure 2 we find the reason for this: there is a small 
area on the left part of the curve where the IBM-1 model based confidence 
measure actually discriminates better than all of the other confidence measures. 
Nevertheless, the overall performance of the word posterior probabilities based 
on target positions and of those based on the word count are better than that 
of the IBM-1 based confidence measure. 

We assume that the better performance of the IBM-1 based confidence mea- 
sure is due to the fact that the involved lexicon probabilities do not depend on 
the specific Wbest lists, but on a translation model that is trained on the whole 
training corpus. Thus, they are more exact and do not rely on an approximation 
as introduced by the Levenshtein alignment (cf. Eq. 3). Moreover, the tagging 
threshold can be estimated very reliably, because it will be the same for the 
develop and the test corpus. In order to verify this assumption, we analyzed the 
CAR on the develop corpus. This is used for the optimization of the tagging 
threshold. Thus, if the other measures indeed discriminate as well as or better 
than the IBM-1 based confidence measure, their CAR should be higher for the 
optimized threshold. Table 4 presents these experiments. We see that indeed the 
word posterior probabilities based on target positions and those based on word 
counts have a high accuracy rate and show a performance similar to (or better 
than) that of the IBM-1 based confidence measure. 



Table 4. Comparative experiment: CAR [%] on the develop set (threshold optimized). 
Results for error counting based on PER and WER (see Sec. 4.3) 





word error measure 


WER 


PER 


Baseline 


60.5 


67.1 


averaged target position (Eq. 4) 


62.1 


67.5 


word counts (Eq. 5) 


67.3 


72.1 


target position (Eq. 3) 


69.0 


71.3 


IBM-1 (Eq. 6) 


67.8 


72.5 



5 Outlook 

We saw in the derivation from Bayes risk that word posterior probabilities are 
closely related to error rate minimization. Moreover, we found that they 
show a state-of-the-art performance as confidence measures on the word level. 
Therefore, we plan to apply them directly in the machine translation process 
and study their impact on translation quality. One possible approach would be 



® We only present the ROC curve for the word error measure PER here. The curve 
for WER looks very similar. 
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Fig. 2. ROC curve on the test set. The error counting is based on PER (see Sec. 4.3) 

to combine them with the sentence error based decision rule that is widely used 
for rescoring iV-best lists. 

6 Conclusion 

In this work, we have taken first steps towards studying the relationship be- 
tween Bayes decision rule and confidence measures. We have presented two forms 
of Bayes decision rule for statistical machine translation: the well-known and 
widely-applied rule for minimizing sentence error, and one novel approach that 
aims at minimizing word error. We have investigated the relation between two 
different word error measures and word confidence measures for SMT that can 
be directly derived from Bayes risk. 

This approach lead to a theoretical motivation for the target position based 
confidence measures as proposed in [15]. In addition, we derived new confidence 
measures that reduced the baseline error in discriminating between correct and 
incorrect words in MT output by a quarter. Other studies report similar reduc- 
tions for Chinese-English translation [1,2]. 
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Abstract. The automatic identification of direct and indirect discourses and the 
association of each “direct” utterance with its author are research topics that begin 
to be explored in Natural Language Processing. 

We developed the DID system that when applied to children stories starts by 
classifying the utterances that belong to the narrator (indirect discourse) and those 
belonging to the characters taking part in the story (direct discourse). Afterword, 
DID tries to associate each direct discourse utterance with the character(s) in the 
story. 

This automation can be advantageous, namely when it is necessary to tag the 
stories that should he handled by an automatic story teller. 



1 Introduction 

Children stories have some intrinsic magic that captures the attention of any reader. This 
magic is transmitted by intervening characters and by the narrator that contributes to the 
comprehension and emphasis of the fables. Inherent to this theme emerges the direct 
and indirect discourse apprehension by the human reader that corresponds to characters 
and narrator, respectively. 

The automatic identification of speakers in children’s stories is a necessary step for 
the comprehension of the story, namely when it is necessary to tag the stories that should 
be handled by an automatic story teller. After knowing which portions of the story should 
be read by which speaker, it is possible to choose the appropriate voices for synthesizing 
the story characters [5], to choose the appropriate character representation and animate 
it in a story teller [1]. 

This work deals with the identification of the character (the narrator may be consid- 
ered another character, for this purpose) that is responsible for each story utterance. The 
result is expressed in a final document with tags associated with each character. 

For example, consider the following excerpt of a story' : 

They arrived at the lake. The boy waved to them, smiling. 

Come, it is really good! 



* Although some examples are in english, our system only handles Portuguese texts. 

J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 82-90, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 
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Our system identifies the text associated with each character of the story: 

<person name=narrator> 

They arrived at the lake. The boy waved to them, smiling. 
</person> 

<person name =boy> 

Come, it is really good 
</person> 

Our approach consists of two basic stages: (i) identification of the utterances that 
belong to the narrator and the utterances that are said by a character, and (ii) association 
of each character utterance wit a specific story character. The first stage is described in 
Section 2 and the second stage is described in Section 3. 

2 Direct/Indirect Discourse Identification 

2.1 Pre-processing 

In order to apply DID it is necessary to resort to Smorph [6], a morphological analyzer, 
and PasMo [4], which divides the text into paragraphs and transforms word tags. Thus, 
the story texts are first submitted to Smorph and then PasMo, which produces XML 
documents. 

2.2 Solution 

We started by collecting a set of children stories, all of them written by Portuguese 
authors. These dorpora was divided into a training set (the first eleven stories of Table 1), 
and a test set (the last four stories of the same table). After hand inspecting the training 
set, we extracted twelve heuristics: 

HI A dash at the beginning of a paragraph identifies a direct discourse; 

H2 A paragraph mark after a colon suggests the paragraph corresponds to a character 
(direct discourse); 

H3 If a paragraph has a question mark at the end then it, probably, belongs to a character; 
H4 The exclamation mark at the end of a paragraph identifies a direct discourse, with 
some probability. This heuristic follows the reasoning of H3; 

H5 The personal or possessive pronouns in the 1st or 2nd person indicate that we are 
in the presence of a direct discourse; 

H6 Verbs in past tense, present, future or imperfect tense are characteristics of direct 
discourse because they are verbs directed to characters; 

H7 The usage of inverted commas can indicate the speech of a character, but gener- 
ally it is the narrator imitating the character and not the character speaking about 
himself/herself; 

H8 The usage of tense adverbs (tomorrow, today, yesterday, etc.) can identify a direct 
discourse; 

H9 If next to a direct discourse there is a short text between dashes, then the next text 
excerpt probably belongs to a character; 
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HIO The imperfect tense verbs that can be expressed in the same way for a character 
and for a narrator just lead to a direct discourse when there is a personal pronoun 
corresponding to a character; 

Hll In the phrase, if there is a text excerpt between two dashes where a declarative verb 
exists (declare, say, ask, etc.) in the third person, then we can say that a character 
expresses the text excerpt appearing before the left dash; 

H12 The use of interjections identifies a direct discourse because only characters use 
them. 

The input of DID is PasMo’s output. DID analyses the text paragraph by paragraph. 
Heuristics are then applied to each one. After processing the whole text, DID returns 
an XML document, in VHML format [2], that contains all the identified discourses 
accordingly to the tags supported by this language. 

DID followed the Theory of Confirmation to get the degree of trust with which 
identified direct discourse: the user can define the trust to associate with each heuristic 
and also the value of its threshold, which defines the limit between success and failure. 
Thus, we can say that DID works like an expert system. 

However, DID first results made us improve these heuristics, namely: 

- H3 and H4 have different trust values deppending on the position of the mark on the 
paragraph. If there is a question or exclamation mark in the middle of a paragraph, 
the trust value is lower. When the mark is at the end of the paragraph the trust value 
is higher. So, these heuristics have two trust values. 

- H5 and H6 have been combined, because DlD’s input has many ambiguities. 

- H7 revealed to be a neutral heuristic so it was removed. 

Final results can be seen in Table 1 



Table 1. Performance of direct/indirect discourse separation 



Story 


Correct 


Incorrect 


Success rate 


1-0 Gato das Botas 


28 


0 


100% 


2-0 Macaco do Rabo Cortado 


48 


0 


100% 


3-0 Capuchinho Vermelho 


41 


1 


97% 


4 - Os Trs Porquinhos 


28 


1 


96% 


5 - Lisboa 2050 


147 


6 


96% 


6 - A Branca de Neve 


43 


2 


95% 


7 - Ideias do Canrio 


41 


2 


95% 


8 - Anita no Hospital 


102 


11 


90% 


9 - Os Cinco e as Passagens Secretas 


131 


19 


87% 


10 - A Bela e o Monstro 


31 


6 


83% 


11-0 Bando dos Quatro: A Torre Maldita (Chap. 1) 


70 


40 


63% 


12 - Pinquio 


43 


1 


97% 


13-0 estratagema do amor 


147 


11 


93% 


14 - 0 rei 


81 


9 


90% 


15 - Aduzinda e Zulmiro a magia da adolescncia 


95 


12 


88% 
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2.3 Evaluation 

In order to check the capabilities of DID system, we developed a new system: DID- 
Verify, which is responsible for the comparison between DID’s output and the idealized 
result. This comparison verifies whether discourses were well identified by DID and also 
shows the number of times that each heuristic was applied. 

After analyzing the results obtained with the training set, we can easily infer that the 
best results are obtained for the children stories (e.g. O Gato das Botas, O Macaco do 
Rabo Cortado), what can be explained by the fact that characters are mainly identified by 
Heuristic 1 . The worst result is obtained with the story O Bando dos Quatro, because here 
the narrator is also a character of the story, leading to an ambiguous agent: sometimes 
speaking like a narrator and others like a character. DID is not prepared to treat this 
ambiguity. Two children stories achieved 100% successful results, confirming the good 
performance of DID as a tagger for a Story Teller System under development by other 
researchers of our research institute. The result obtained for the story Lisboa 2050 must 
be heightened because this story has a large number of discourses and DID performs 
a 96% successful result! Summarizing the results, DID obtains an average of 89% of 
success showing that the results are similar to the projected objectives. 

Analyzing the test set, all the results surpass 80% of success with an average of 92%. 
That is very reasonable for a set of texts that was not used to train the DID system. This 
result also shows that DID has a fine performance in different types of stories. 

Examining the results obtained by DID-Verify with the test set, we obtained the 2, 
which shows the performance of each heuristic. Here we conclude that Heuristic 1 is 
the most applied, identifying a larger number of discourses correctly. Heuristic 5 and 
Heuristic 6 also lead to good results. Heuristic 2 never fails but was only applied six 
times. Heuristic 4 is the one that leads to more mistakes, because the exclamation mark is 
many times used in narration discourses. Generally, all the heuristics have a high success 
rate. 



Table 2. Analysis of the correctness of each heuristic 



Heuristic 


N Successes 


N Failures 


Success rate 


HI 


188 


2 


98.9% 


H2 


6 


0 


100% 


H3 


59 


1 


98.3% 


H4 


37 


3 


92.5% 


H5 


81 


2 


97.6% 


H6 


70 


1 


98.6% 


H8 


7 


1 


87.5% 


H12 


17 


1 


94.4% 



3 Character Identification 

3.1 VHML Changes 

Sometimes, it is not clear, even for humans, which is the character that must be associated 
with a given utterance. To allow the representation of this king of ambiguity, and to avoid 
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the repetition of utterances whenever an utterance is simultaneously said by multiple 
characters, we made small changes to the VHML language. Namely, we introduced the 
concept of speaker: 

<! ELEMENT vhitil (paragraph | p | person | ref erences | speaker | mark) +> 
<!ELEMENT speaker ( (personname | colective )+, person) > 

< 'ELEMENT colective (personname, personname+ ) > 

<! ELEMENT personname (#PCDATA)> 

The following example represents an ambiguity: Text..." must be associated either 
with a group of characters (Characterl, Character2 and Characters) or with 
Character4: 

<speaker> 

<per sonname>Groupl< /personname> 

<colective> 

<personname>Characterl</personname> 

<personname>Character2</personname> 

<personname>Character3</personname> 

</colective> 

<personname>Character4</personname> 

<person> 

<p>Text . . . </p> 

<person> 

</speaker> 



3.2 Pre-processing 

The text is processed by a shallow parsing module - SuSAna - that performs efficient 
analysis over unrestricted text. The module recognizes, not only the boundaries, but also 
the internal structure and syntactic category of syntactic constituents [3]. It is used to 
identify the nucleus of the noun phrases. 

A single noun phrase (SNP), is a noun phrase containing either a proper noun or an 
article followed by a noun phrase. 

We only considered as declarative the following verbs: acrescentar, acudir, adi- 
cionar, afirmar, anunciar, aparecer, argumentar, atalhar, atirar, avisar, chamar, comu- 
nicar, confessor, continuar, concluir, declarar, dizer, exclamar, explicar, expor, fazer, 
gritar, interromper, manifestar, meter, noticiar, observar, ordenar, pensar, perguntar, 
publicitar, redarguir, repetir, replicar, resmungar, responder, retorquir, rosnar, ser. 

The system knows the characters that are referred in each story, which is expressed 
in a XML file with the following format: 

<characters> 

<newcharacter> 

<name>Character name</name> 

<gender>male , female or neutral<gender> 
<cardinality>singular or plural</ cardinal ity> 
<alterntivename> 

<name>Alternative name l</name> 

<name>Alternative name 2</name> 
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<name>Alternative name 3</name> 
< /alternativename> 

</newcharater> 



</characters> 

3.3 Solution 

From the hand inspection of the training set (the first eleven stories of Table 1), we 
extracted five rules. They are not called heuristics to avoid confusions with direct/indirect 
heuristics, already presented. 

Rule 1 If the first sentence of the indirect discourse (imediately following a direct dis- 
course) contains a declarative verb (3rd person) that appears before the first SNP, 
and the SNP is a valid character name, then that name is the responsible for the 
previous direct discourse utterance. 

Example, from the corpora: 

- Mas no vo conseguir - disse a Ana, inesperadamente. 

Rule 2 If in a direct discourse, any sentence belonging to a previous (search is performed 
from the direct discourse to the beginning of the document) indirect discourse, the 
word that precedes a SN containing a character name is a declarative verb on the 
3rd person, then the noun of that SN refers the character responsible for the direct 
discourse. 

Example: 

E mais uma vez apareceu o lobo mau: 

- TOC! TOC! TOC! 

Rule 3 If in a direct discourse, any sentence belonging to a previous (search is performed 
from the direct discourse to the beginning of the document) direct discourse starts 
with a SN containing a character name, then the noun of that SN refers the character 
responsible for the present direct discourse. 

Example: 

E eis que a menina bate porta... 

— Av, sou eu. Tenho aqui uma pequena prenda para si... 

Rule 4 If the direct discourse itself contains (search is performed from the beginning to 
the end of the direct discourse) either a SN containing a character name preceded by 
a verb on present, 1st or 3rd person, or a SN containing a character name imediately 
preceded by the verb “chamar” in the present, reflexive 1st person, then the noun of 
that SN refers the character responsible for the present direct discourse. 

Example: 

Uma rapariga loira e bonita apresentou-se: 

— Sou a Sofia e vou tratar da sua viagem. Que filme que quer ver? 
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Rule 5 If in a direct discourse, any sentence belonging to a previous (search is performed 
from the direct discourse to the beginning of the document) direct discourse has a 
SN containing a character name, not preceded by a declarative verb, or imediately 
followed by a punctuation mark, then the noun of that SN refers the character 
responsible for the present direct discourse. 

Example: 

- At que enfim que te encontro, Anita! Procurei-te par todo o lado. Tenho uma 
carta para ti. 

— Uma carta? Para mim? 



3.4 Evaluation 

The evaluation process compares the noun included in the selected NP with the names 
described in the file that contains the enumeration of all story characters. If the noun is 
either included (a string operation) in the correct name of the character, or is included in 
any of the alternate names, we consider it a correct identification. Sometimes, it may lead 
to a “soft” evaluation, since the name porquinho is considered a correct identification of 
any of the following characters: porquinho mais velho, porquinho do meio, and porquinho 
mais novo. 

The first step was the independent evaluation of each rule, see Table 3. Then we 
trained a decision tree (CART) to identify the rule that is able to predict the character 
responsible for each direct utterance. To train the decision tree, we used as features: 

1 parameter containing the rule with the correct result; 

5 parameters containing the information about the use of each rule: 0 means the rule 
did not trigger, 1 that the rule did trigger; 

10 parameters containing the agreement between rules: 0 means that at least one of the 
rules did not trigger, 1 the rules agree, and 2 the rules did not agree. 

The performance of the decision tree after training was as is depicted in Table 4: 
84.8% of correct answers on the training corpus (145 correct answers out of 171), and 
65.7% of correct answers on the test corpus (23 correct out of 35). Table 5 contains the 
confusion matrix after the training. 

3.5 Discussion 

First of all, the performance achieved is very good, considering that these are the first 
results, and there is place for improving every stage of the chain processing. One can 
also conclude that both corpus are small, and the test corpus is not ???????, since it 
does not contain elements of class 4 (R4 should be applied) and class 5 (R5 should be 
applied). 

The rules, when they decide to trigger, produce good results, but either they should 
be active more often (68% of direct discourses are not associated with any character), or 
other rules are missing. R4 and R5 should be revised, since they are triggered only 1%. 

After a more detailed evaluation of the errors in both corpora we concluded that: 
(i) only twice a part-of-speech tagger is the culprit; (ii) the incorrect identification of 
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Table 3. Results of DID measured by DID-Verify 





Rule 1 


Rule 2 


Rule 3 


Rule 4 


Rule 5 1 


Story 


DD 


App 


Succ 


App 


Succ 


App 


Succ 


App 


Succ 


App 


Succ 




# 


% 


% 


% 


% 


% 


% 


% 


% 


% 


% 


1 


7 


42.8 


100 


42.8 


100 


14.3 


100 


0 


— 


0 


— 


2 


28 


14.3 


100 


14.3 


0 


32.1 


88.9 


0 


— 


0 


— 


3 


20 


5 


— 


0 


— 


30 


66.7 


0 


— 


0 


— 


4 


15 


6.6 


100 


13.3 


100 


20 


66.7 


30 


100 


0 


— 


5 


91 


4.4 


100 


6.6 


50 


18.7 


52.9 


2.2 


100 


3.3 


100 


6 


13 


7.7 


100 


15.4 


50 


30.8 


75 


0 


— 


0 


— 


7 


36 


0 


— 


0 


— 


2.8 


100 


0 


— 


0.0 


100 


8 


66 


7.5 


100 


9.1 


66.7 


22.7 


40 


1.5 


100 


1.5 


100 


9 


76 


46 


100 


48.7 


75.7 


3.9 


66.7 


0 


— 


1.3 


100 


10 


9 


0 


— 


0 


— 


22.2 


50 


0 


— 


0 


— 


11 


62 


6.5 


100 


8.1 


40 


8.1 


40 


0 


— 


4.8 


0.0 


Sum 


423 


13.7 


98.3 


15.6 


65.2 


15.6 


59.1 


1.4 


100 


2.1 


66.7 


12 


15 


13.3 


100 


20 


33.3 


0 


— 


0 


— 


0 


— 


13 


104 


2.9 


100 


4.8 


80 


1 


100 


0 


— 


0 


— 


14 


38 


5.3 


100 


5.3 


50 


7.9 


— 


0 


— 


0 


— 


15 


59 


6.8 


50 


6.8 


50 


23.7 


57.1 


0 


— 


0 


— 


Sum 


216 


5.1 


81.8 


6.5 


57.1 


8.3 


50 


0 


— 


0 


— 



Table 4. Performance of the Decision Tree on Identifying the Character of each Direct Utterance 





Recall 


Precision 


Training Corpus 
Test Corpus 


34.3% 

10.6% 


84.8% 

65.7% 



Table 5. Confusion Matrix of the Decision Tree on Identifying the Character of each Direct 
Utterance 





Training corpus I 


- 


R1 


R2 


R3 


R4 


R5 


Sum 


Succ 


- 


0 


1 


11 


11 


0 


2 


25 


0% 


R1 


0 


57 


0 


0 


0 


0 


57 


100% 


R2 


0 


0 


42 


0 


0 


0 


42 


100% 


R3 


0 


0 


1 


36 


0 


0 


37 


97.3% 


R4 


0 


0 


0 


0 


5 


0 


5 


100% 


R5 


0 


0 


0 


0 


0 


5 


5 


100% 


Sum 


0 


58 


54 


47 


5 


7 


171 


83.3% 





Test corpus 


- 


R1 


R2 


R3 


R4 


R5 


Sum 


Succ 


- 


0 


0 


5 


6 


0 


0 


11 


0% 


R1 


0 


9 


0 


0 


0 


0 


9 


100% 


R2 


0 


0 


7 


0 


0 


0 


7 


100% 


R3 


0 


1 


0 


7 


0 


0 


8 


87.5% 


R4 


0 


0 


0 


0 


0 


0 


0 


100% 


R5 


0 


0 


0 


0 


0 


0 


0 


100% 


Sum 


0 


10 


12 


13 


0 


0 


35 


68.6% 



direct/indirect discourse is responsible for 1 3 mistakes (rule R1 - 1 , R2-2, R3-7 and R5-3); 
(iii) The shallow parser did not identify a NP and, consequently, R3 failed to identify the 
correct character 4 times; and the decision tree has made an incorrect choice 24 times. 
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4 Future Work 

We have made good progress on children story interpretation/marking, but, since it is a 
complex and heterogeneous problem, a lot of work remains to be done, namely: 

- to increment the size and variety of both corpora; 

- to define associations of words and expressions to help identify some type of story 
characters; 

- to use a morphossyntatic disambiguator to handle the ambiguous word classifica- 
tions; 

- to improve the automatic text segmentation in sentences ; 

- to define a set of verbs that cannot be expressed by a narrator; 

- to identify relations (ownership and creation) between objects and characters; 

- to identify the family relations between characters; 

- to introduce in the processing chain a new module to identify the characters taking 
part in a story, which is, for the moment, given by the user; 

- to use anaphora to help identifying characters referred by pronouns; 

- to implement and evaluate a new rule: whenever two consecutive direct utterances 
are associated with different characters, and until the next indirect utterance, consider 
the identified characters alternate in the enclosed direct discourses; 

- to include propositional and adjective attachments in the noun phrases, to enable a 
better identification of the story characters; 

- to identify the subject and complements of each sentence, which will enable a better 
redesign/performance of rules R2 and R3; 

- to develop new modules to identify the gesture and emotions as well as the environ- 
ment where each scene takes place. 
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Abstract. This paper compares two approaches to a multilayered Ques- 
tion Answering (QA) architecture suitable for enhancing current QA ca- 
pabilities with the possibility of processing complex questions. That is, 
questions whose answer needs to be gathered from pieces of factual in- 
formation that is scattered in different documents. Specifically, we have 
designed a layer oriented to process the different types of temporal ques- 
tions. In the first approach, complex temporal questions are decomposed 
into simple questions, according to the temporal relations expressed in 
the original question. In the same way, the answers of each resulting 
simple question are recomposed, fulfilling the temporal restrictions of 
the original complex question. In the second approach, temporal infor- 
mation is added to the sub-questions before being processed. Evaluation 
results show that the second approach outperforms the first one in a 30%. 



1 Introduction 

Question Answering can be defined as the answering by computers to precise or 
arbitrary questions formulated by users. There are different types of questions 
that achieves to answer. Current QA systems are mainly focused on the treat- 
ment of factual questions, but these systems are not so efficient if the questions 
are complex, that is, questions composed by more than one event interrelated 
by temporal signals, like after, before, etc... The task of answering this type of 
questions is called Temporal Question Answering. 

Temporal QA is not a trivial task due to the complexity temporal questions 
can achieve, and it is not only useful when dealing with complex questions, but 
also when the questions contain any kind of temporal expression that needs to 
be solved before being able to answer the question. For example, for the question 
“Who was president of Spain two years ago?”, it will be necessary to solve the 
temporal expression “two years ago” first and then use that information in order 
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to answer the question by a current QA system. It seems necessary to emphasize 
the system described in Breck et al.[l] as the only one that also uses implicit tem- 
poral expression recognition for QA purposes by applying the temporal tagger 
developed by Mani and Wilson [2]. However, questions referring to the temporal 
properties of the entities being questioned and the relative ordering of events 
mentioned in the questions are beyond the scope of current QA systems: 

— “Who was spokesman of the Soviet Embassy in Baghdad the invasion 

of Kuwait?” 

This work presents a QA system that achieves to answer complex temporal 
questions. This proposal tries to imitate human’s behavior when solving this 
type of questions. For example, a human that wants to answer the question: 

would follow this process: 

1. First, he would decompose the example complex question into two simple 

ones: . and 

2. He would look for all the possible answers to the first simple question: 

3. After that, he would look for the answer to the second simple question: 

4. Finally, he would give as final answer one of the answers for the first question 

(if there is any), whose associated date is included within the period of 
dates corresponding to the answer of the second question. That is, he would 
obtain the final answer by recomposing the respective answers to each simple 
question through the temporal signal in the original question ( ). 

Or this process: 

1. First, he would decompose the example complex question into two simple 

ones: . and 

2. He would look for the answer to the second simple question which is asking 
for a date: 

3. After that, first question could be rewritten as: 

4. Finally, he would give as final answer one of the answers for this question (if 

there is any), but in this case, the searching of the answer is more exactly 
because only the answers of the first question included in that period of time 
are obtained. ( , , is part of the first question now). 

Therefore, the treatment of complex question is basically based on the de- 
composition of these questions into simple ones that can be resolved using con- 
ventional QA systems. Answers to simple questions are used to build the answer 
to the original question. Our proposal to solve the problem is based on a mul- 
tilayered architecture that allows the processing of the questions with temporal 
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features. In addition, two proposals for the resolution of this problem have been 
developed with the objective of determining which one is better. 

The paper has been structured in the following way: first of all, section 2 
presents our proposal of a taxonomy for temporal questions. Section 3 describes 
the two approaches to our general architecture of a temporal QA system. Finally, 
in section 4, the evaluation of the system and some conclusions are shown. 



2 Proposal of a Temporal Questions Taxonomy 

Before explaining how to answer temporal questions, it is necessary to classify 
them, since the way to solve them will be different in each case. Our classifica- 
tion distinguishes first between simple questions and complex questions and was 
deeply presented in Saquete et al. [6]. We will consider as simple those questions 
that can be solved directly by a current General Purpose QA system, since they 
are formed by a single event. On the other hand, we will consider as complex 
those questions that are formed by more than one event related by a temporal 
signal which establishes an order relation between these events. 

Simple Temporal Questions: 

f ) 

They are resolved by a QA System directly without pre or postprocessing of the 
question. Example: 

.The tempo- 
ral expressions needs to be recognized, resolved and annotated. Example: 

Complex Temporal Questions: 

. Ques- 
tions that contain two or more events, related by a temporal signal, and a tem- 
poral expression. Example: 



. Ques- 
tions that consist of two or more events, related by a temporal signal. Example: 



3 Multilayered Question-Answering System Architecture 

Current QA system architecture does not allow to process complex questions. 
That is, questions whose answer needs to be gathered from pieces of factual 
information that is scattered in a document or through different documents. In 
order to be able to process these complex questions, we propose a multilayered 
architecture. This architecture [5] increases the functionality of the current QA 
systems, allowing us to solve any type of temporal questions. Complex questions 
have in common the necessity of an additional processing of the question in order 
to be solved. Our proposal to deal with these types of more complex questions 
is to superpose an additional processing layer, one by each type, to a current 
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General Purpose QA system, as it is shown in Figure 1. This layer will perform 
the following steps: 

— the decomposition of the question into simple events to generate simple ques- 
tions or sub-questions (with or without adding extra temporal information 
to the question) and the ordering of the sub-questions, 

— sending simple questions to a current General Purpose QA system, 

— receiving the answers to the simple questions from the current General Pur- 
pose QA system, 

— the filtering and comparison between sub-answers to build the final complex 
answer. 




Fig. 1. Multilayered Architecture of a QA 



Next, we present two approaches of how a layer is able to process temporal 
questions according to the taxonomy shown in section 2. The second approach 
is based on the first one, but tries to resolve some problems observed in the 
evaluation. 

3.1 First Approach of Architecture of a Temporal Question 
Answering System 

The main components of the first approach of a Temporal Question Answering 
System are (See Figure 2 ): Question Decomposition Unit, General purpose QA 
system and Answer Recomposition Unit. 

These components work all together in order to obtain a final answer. The 
Question Decomposition Unit and the Answer Recomposition Unit are the units 
that conform the Temporal QA layer which process the temporal questions, 
before and after using a General Purpose QA system. 

~ is a preprocessing unit which performs 

three main tasks. First of all, the recognition and resolution of temporal 
expressions in the question is done. Secondly, regarding the taxonomy of the 
questions shown in section 2, there are different types of questions and every 
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Fig. 2. Temporal Question Answering System (1st approach) 

type has to be treated in a different way from the others. For this reason, 
type identification needs to be done. After that, complex questions, that are 
Type 3 and 4, are split into simple ones. These simple questions are the input 
of a General Purpose Question- Answering system. For example, the question 

, is divided 

into two sub-questions that are related through the temporal signal : 

• Ql: Where did Bill Clinton study? 

• Q2: When did go to Oxford University? 

. Simple factual questions 
generated are processed by a General Purpose QA system. Any QA system 
could be used here. We have used a general QA system that is available on 
the Internet: the 10 Question Answering system^. The only condition is to 
know the kind of output returned by the system in order to adapt the layer 
interface. For the example above, a current QA system returns the following 
answers: 

• Ql Answers: Georgetown University (1964-68) // Oxford University 
(1968-70) // Yale Law School (1970-73) 

• Q2 Answer: 1968 

is the last stage in the process. The com- 
plex questions were divided by the Decomposition Unit into simple ones 



http://www.ionaut.com:8400/ 
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with successful results of precision and recall. These simple questions were 
processed by a QA system which returns a set of answer and passages of 
documents where the answer is contained. The Answer Recomposition Unit 
needs a preprocessing of this amount of information in order to relate the an- 
swers with a date. This date is obtained from the set of passages related with 
the answer because this date is necessary in order to filter the answers by 
the Individual Answer Filtering unit. In this unit, the temporal constraints 
imposed by the Temporal Expressions of the question are applied to all the 
answer and some wrong answers are rejected. Finally, using the ordering 
key imposed by the Temporal Signal of the complex questions, the single 
answers are ordered and a final answer to the complex question is obtained. 
This process is divided in three modules: 

• Preprocessing of the QA system output: Because our system is indepen- 
dent of the General Purpose QA system used to answer the questions, 
a preprocessing module will be necessary in order to format these an- 
swers to the specific structure that the recomposition unit is waiting for. 
The kind of input that the recomposition unit is waiting is a file with 
all the possible answers to the questions and the dates related to these 
answers. To obtain these dates, TERSEO system has been applied to 
the document passages where the answer is found. TERSEO system is 
used to recognize, annotate and resolve all the temporal expressions in 
the passages so that it is possible to obtain a date of occurrence of the 
event the system is asking about. The system looks for the event in the 
document and obtains the date related to this event. Once the answer 
and the date of the answer are obtained, the recomposition can be done. 

• Individual Answer Filtering: All the possible answers given by the Gen- 
eral Purpose QA system are the input of the Individual Answer Filtering. 
For the sub-questions with a temporal expression, it selects only those 
answers that satisfy the temporal constraints obtained by the TE Recog- 
nition and Resolution Unit as temporal tags. Only those answers that 
fulfill the constraints go to the Answer Gomparison and Gomposition 
module, and 

• Answer Gomparison and Gomposition: Finally, once the answers are fil- 
tered, using the signals and the ordering key implied by these signals, 
the results for every sub-question are compared by the Gomparison and 
Gomposition Unit. 

This unit has as input the set of individual answers and the temporal 
tags and signals related with the question, information that is needed to 
obtain the final answer. Temporal signals denote the relationship between 
the dates of the events that they relate. Assuming that is the date re- 
lated to the first event in the question and is the date related to the 
second event, the signal will establish a certain order between these events, 
which is called , . An example of some ordering keys are shown in 

Table 1. 
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Table 1. Example of signals and ordering keys 



SIGNAL 


ORDERING KEY 


After 


FI > F2 


Before 


FI < F2 


During 


F2 (begin) <= FI <= 
F2(end) 


From F2 to F3 


F2 <= FI <= F3 


On / in 


FI = F2 


While 


F2 (begin) <= FI <= 
F2(end) 


At the time of 


FI = F2 


Since 


FI > F2 



3.2 Evaluation of the System with the First Approach 

For this evaluation, we chose the TERQAS question corpus [4] , [3] that consists of 
124 complex temporal questions. This set of questions was split by the Question 
Decomposition Unit into simple questions. The answers to these simple questions 
were obtained by a General Purpose QA system^ and they were recomposed by 
the Recomposition unit. The results for every approach have been compared 
with the results obtained by a General Purpose QA system and can be classified 
in three main groups: 



Table 2. Evaluation of the system with the first approach 



Questions 


TOTAL 


BETTER 


EQUAL 


WORST 






RESULTS 


RESULTS 


RESULTS 


Type 1 


47 


- 


47 (100%) 


- 


Type 2 


59 


36 (61%) 


23 (39%) 


- 


Type 3 


3 


- 


3 (100%) 


- 


Type 4 


15 


6 (40%) 


8 (54%) 


1 (6%) 


All Questions 


124 


42 (34%) 


81 (65%) 


1 (0.8%) 



— The results are the same in both systems. That is because: 

• The QA system does not give back any answer for that question and 
therefore the TQA system does not give back anything either. There are 
47 questions of that kind and the type of questions more affected are 
Type 1 and Type 2. 

• The TQA system returns the same answers as the QA system does. This 
it is exclusively the case of Type 1 questions since our system does not 
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make any type of processing on those questions. There are 34 questions 
of this kind in our set. 

— Our system obtains better results than the QA system. There are four dif- 
ferent situations: 

• The TQA system does not give back any answer because, although QA 
system gives back a set of answers, none of them fulfill the temporal con- 
straints imposed by the question and therefore none of these answer is the 
correct answer. This would be considered a success on our system. There 
are 12 questions of this kind, and 11 of them are questions of Type 2. 

• The QA system does not give back any answer and nevertheless, when 
splitting the question in simple ones and later reconstructing the answer, 
the TQA system is able to give an answer to the complex question. There 
is only 1 question of this kind. 

• The QA system returns wrong answers, nevertheless, when filtering the 
Temporal Expressions and splitting the question, more temporal infor- 
mation is obtained and the TQA system is able to answer properly to 
the complex question. There is only 1 question of this type. 

• The QA system returns a set of answers, but without considering the 
temporal information and the TQA is able to filter those answers and 
giving back just those that fulfill the temporal restrictions. Therefore, in 
this case, the TQA is answering better than the QA system. There are 
28 questions of this type and they are questions of Type 2 and Type 4. 

— Our system obtains worst results than the QA system 

• The QA system is able to answer but the TQA is not. That is because, 
when the complex question is divided into two simple ones, there are 
some keywords in the second questions that are not being used to ask 
for the first question and these keywords may be useful to find any an- 
swer. For example, in the question 

, the system looks for the keywords: 
and and returns answers like , 

, . . . But, using the TQA system, the question is di- 
vided into two simple questions: , and 

. When the first question is processed 
the results are not good because the information given by the keyword 
is missed. However, there is only 1 question with this prob- 
lem. 

The results of this study are shown in Table 2. As a conclusion, it could be 
said that our system is improving a General Purpose QA system in the 34% of 
the questions and it works worst only in less that 1% of the questions. 

3.3 Second Approach of Architecture of a Temporal Question 
Answering System 

After the evaluation of the first approach, it can be observed that when splitting 
complex questions into two independent sub-questions, some information neces- 
sary to answer the complex questions can be lost. For example, for the question 
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, instead of dividing the 

question in: , and , 

the search will be far better if the second question, that refers to a date, is solved 
and, after that, the first question is transformed adding this temporal data ob- 
tained to it: . , , . Using this approach, 

questions that were not answered first by lack of information could be solved 
now. In addition, this approach makes a more exhaustive filtering of the results 
than the first one, since there are more resolute temporal information before 
formulating the question to a current QA system. 

In order to be able to evaluate this new solution, a new intermediate unit 
that makes the transformation of the first question using the answer of second is 
added to the first approach. In addition, this unit is in charge to transform the 
possible implicit temporal expressions that can appear in the questions as well 
(See Figure 3 ). 




Fig. 3. Temporal Question Answering System (2nd approach) 



This unit is only useful when we are dealing 
with Type 3 or 4 questions, that are complex questions that need to be 
divided in two for being answered and Type 2 questions that have a temporal 
expression in the question. There are two possible transformations of the first 
question: 
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• Temporal expressions of the question are transformed into dates or con- 
crete ranges and this transformed question goes through the General 
Purpose QA system. 

• The second sub-question, which is a When question, is answered by the 
General Purpose QA system, and the answer of this second question, 
which is a date or period is added to the first question. Now, the answer 
of the transformed question is obtained by the QA system. 

• What happened to world oil prices after the Iraqi ” annexation” of Kuwait? 

1. Type: 4,Temporal Signal: after 
Initial Decomposition: 

Ql: What happened to world oil prices? 

Q2: When did the Iraqi ’’annexation” of Kuwait occur? 

2. AnswerQ2:[0I/0I/I991-I2/3I/I99I] 

3. Transformation of QI: 

What happened to world oil prices since December 31,1991? 

• Did Hussein meet with Saddam before he (Hussein) met with Bush? 

1. Type: 4,Temporal Signal: before 
Initial Decomposition: 

Ql: Did Hussein meet with Saddam? 

Q2: When did he (Hussein) met with Bush occur? 

2. AnswerQ2:[01/15/2004] 

3. Transformation of Ql: 

Did Hussein meet with Saddam until January 15, 2004? 

• Who was spokesman of the Soviet Embassy in Baghdad at the time of 
the invasion of Kuwait? 

1. Type: 4,Temporal Signal: at the time of 
Initial Decomposition: 

Ql: Who was spokesman of the Soviet Embassy in Baghdad? 

Q2: When did the invasion of Kuwait occur? 

2. AnswerQ2:[01/01/1990-12/31/1990] 

3. Transformation of Ql: 

Who was spokesman of the Soviet Embassy in Baghdad between in 
1990? 

• Who became governor of New Hampshire in 1949? 

1. Type: 2, Temporal Signal: -, Temporal Expression: in 1949 

2. Transformation of Ql: 

Who became governor of New Hampshire from January 01,1949 to 
December 31,1949? 



3.4 Evaluation of the System with the First Approach 

In this evaluation, two kinds of results are shown: 

— In one hand, which questions have a smaller and more precise set of answers 
than in the first approach but still correct. 
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Table 3. Evaluation of the system with the second approach (NQ = Number of Ques- 
tions and AA = Average of Answers) 





TOTAL 


BEST RESULTS 
1st APPROACH 


BEST RESULTS 
2nd approach 


IMPROVEMENT 

(%) 






NQ 


AA 


NQ 


AA 


NQ(%) 


AA{%) 


Type 1 


47 


- 


- 


- 


- 


- 


- 


Type 2 


59 


36 


2.8 


9 


1.6 


25% 


57% 


Type 3 


3 


- 


- 


- 


- 


- 


- 


Type 4 


15 


6 


6 


4 


2 


67% 


33% 


All Questions 


124 


42 


2.83 


13 


1.6 


30% 


56% 



— On the other hand, we have calculated the average number of answers of this 
question corpus for the first approach and for the second with the objective 
of determining if this set is reduced. 

After evaluating this new approach and comparing the results with first one, 
the following results are observed: 

— There are the same number of questions in which the TQA system answered 
the same as the QA system in both approaches. 

— Nevertheless, 13 of those questions in which the TQA system was better 
than the QA system have been improved, because the new set of answers 
for these questions are much more precise and filtered. The set of answers 
is smaller but in every case the correct answer belongs to this new set of 
answers. Therefore, the possibility of obtaining an erroneous answer for that 
question is now reduced with this new approach. 

• There are 9 questions of Type 2 that have better answers for this ap- 
proach due to the previous resolution of the temporal expression before 
asking the General Purpose QA system. In average, for the first approach 
there are 2.8 answers per question and with the second approach there 
are 1.6 answers and the correct one is in this group of questions. 

• There are 4 questions of Type 4 that have better answers because the 
answer of the second sub-question is being used to reformulate the first 
question before asking the QA system. That gives extra information to 
the QA system and that is why the results are much better. Moreover, 
in average, there are 6 answers for question in the first approach and this 
number is reduced to 2 in the second approach. 

The results of this evaluation are shown in Table 3. This table, in the first 
column, contains the total amount of questions in the question corpus divided 
by each type. The second column refers to the number of questions that were 
better in the first approach compared to the QA system and the number of 
answers as average per question. The third columns shows the number of these 
questions that were much better in the second approach than in the first one 
and the number of answers as average per question, and finally, in the fourth 
column, a percentage of improvement of the second approach compared with the 
first one is presented. 





102 E. Saquete et al. 



4 Conclusions 

This paper presents a new and intuitive method for answering complex temporal 
questions using an embedded factual-based QA system. This method is based on 
a proposal for the decomposition of temporal questions where complex questions 
are divided into simpler ones by means of the detection of temporal signals. The 
TERSEO system, a temporal information extraction system, has been used to 
detect and resolve temporal expressions in questions and answers. 

This work proposes a multilayered architecture that enables to solve complex 
questions by enhancing current QA capabilities. Two approaches have been pre- 
sented. The first one is based on the decomposition of complex questions into 
simpler ones and the second one, apart from the decomposition, reformulates 
the sub-questions adding temporal information to them. 

As a conclusion, the evaluation shows an improvement of the first approach 
of the TQA system of 34% compared with a current QA system. Besides, the 
second approach of the system is a 30% better that the first one. Moreover, the 
number of answers for every question has been reduced, in the second approach, 
more than a 50%, and in this reduction of answers, the correct answer was still 
in the set. 

In future, our work is directed to fine tune this system and increase system 
capabilities in order to be able to process more kinds of complex questions. 
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Abstract. The objective of this paper is to show the text processing, especially 
text generation using the contextual features. For a computer system to operate 
everyday language the context which surrounds language use must be consid- 
ered and this is one of the reasons why we adopt Systemic Functional Linguis- 
tics. In this paper we introduce the database for text processing system called 
“the Semiotic Base”, which contains “the Context Base” as the model of the 
context and show how this system works in the text processing, especially in 
the text planning and generation. 



1 Introduction 

We have developed a computational model of language in context. As a basic theory, 
we adopt Systemic Functional Linguistics (SFL) [1] that aims at describing the sys- 
tem of language comprehensively and provides a unified way of modeling language 
in context. In this section we illustrate the conception of Everyday Language Comput- 
ing and SFL. 

1.1 Everyday Language Computing 

We propose the paradigm shift from the information processing with numbers and 
formal symbolic logic to that with our everyday or commonsense use of language. In 
this project of “Everyday Language Computing”, the role language plays in human in- 
telligence is regarded as important. On the basis of this conception our project aims at 
realizing language-based intelligent systems on computers in which the human lan- 
guage is embedded. In this project we have adopted SFL as the model of language to 
be embedded in the system, and have constructed the Semiotic Base, a database for 
text processing system using SFL, having developed the systems using this database 
[2, 3, 4, 5, 6]. 
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1.2 Systemic Functional Linguistics 

SFL is regarded as functionalism in terms of linguistics. According to Matthiessen 
[7], SFL regards language as a resource for communication rather than a set of rules. 
As such, the main characteristics of linguistic text are the followings: stratification, 
metafunction and instantiation. 

A text can be stratified into four strata: context, meaning, wording (lexicogrammar: 
grammar and lexis) and expression. Context is realized into the meaning of the text, 
the meaning is realized into the wording of the text and the wording is realized into 
the expression of the text. 

According to Halliday the language has three metafunctions: ideational, interper- 
sonal and textual. Ideational metafunction corresponds to the functions by which to 
constitute events expressed in language and to show the logical conjunction of them. 
Interpersonal metafunction corresponds to the functions by which to show the social 
relationship between the interactants. Textual metafunction correspond to the func- 
tions by which to make the text coherent and appropriate to the context in which it is 
produced. In addition context is divided into three values: Field (the social activity 
done in the interaction). Tenor (the social relationship between the interactants) and 
Mode (the nature of media used in the interaction), and they roughly correspond to 
ideational, interpersonal and textual metafunctions respectively. 

A language can be seen as a dine of instantiation. A text is an instance of language 
in context at one extreme of the dine. Language as a whole system is the potential at 
the other. A set of texts using the same linguistic resources constitutes a text type. The 
resources used in the texts can be regarded as a register of the text type. We can place 
these text type and register somewhere in the middle of the dine. 

1.3 The Semiotic Base 

In this section we introduce the overall contents and structure of the Semiotic Base, a 
computational model based on SFL. 

The main components of the Semiotic Base are the Context Base, the Meaning 
Base, the Wording Base and the Expression Base. Each database describes contextual 
or linguistic features to characterize the system of language and relations among the 
features within a base as well as those between other bases. 

The Context Base characterizes context by Eield, Tenor and Mode. Once these 
variables are determined, the context is specified as instances of Situation Types. 
Eurthrer the way of developing a text in each Situation Type is described as a Generic 
structure. In addition to the main bases the Semiotic Base includes Concept Reposi- 
tory, General Dictionary and Situation-Specific Dictionary. 



2 Contents of the Context Base 

In this chapter we show the contents and structure, and data structure of the Context 
Base. 

We have built a prototype system equipped with the Semiotic Base with the Con- 
text Base. This system is made in order for ordinary people to operate a computer 
with their everyday use of language. They, as users, can talk with a client secretary in 
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the computer and create a document such as invitation card or manage the e-mail by 
asking the secretary to do so. 

2.1 Contents of the Context Base 

The Context Base consists of the following contents: databases of Field, Tenor and 
Mode, a database of situation types with generic structure. Further, it is supplemented 
with the Machine Readable Dictionaries (the Concept Repository and the Situation 
Specific Dictionary), the Knowledge Base and the Corpus Base as shown in Fig. 1. 




Linguistic Resources 
•Meaning Base 
•Wording Base 



Conceptuai Repository 




Verbaiized Knowiedge 

• Knowiedge forDescribing Situation 

• Knowiedge in Situation 



Text Corpus 

•Tagged with the SB resource 



Machine Readabie Dictionary 



Fig. 1. Contents of the Context Base 

The values of Field, Tenor and Mode characterize a situation type. We made a pro- 
totype model of context with these situation types. 

This situation type is divided into two: the primary situation type and the secon- 
dary situation type*. In our prototype system we must consider two types of situation. 
One is the situation in which a computer user and a client secretary are talking each 
other in order to create a document with the word processor. This situation is realized 
into a dialogue between the two interactants. We call it primary situation type. The 
other is a sub-situation appearing in the document and so on. In this situation, the user 
invites people to a party, sends summer greetings to people and so on. This type of 
sub-situation appears within the primary situation type. We call it secondary situation 
type. Both situation types are tagged with the values of Field, Tenor and Mode. 

In the prototype system we adopt the viewpoint that the document realizing 
the secondary situation type (such as a summer greetings) is produced through the 
dialogue between the user and the secretary realizing the primary situation type. This 
is why we embed the secondary situation type in the primary situation type. 



* Some systemic functional linguists say that a context of situation can be divided into two: 
first-order and second-order ones [7]. This conception seems similar to ours. However these 
two kinds of context reflect different aspects of the same context and both of them are real- 
ized into the same text. On the other hand, our primary and secondary situation types imply 
different contexts and are realized into different text: a dialogue between the interactants and 
a document to be created through the dialogue respectively. 
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Generic structure describes the development of the text surrounded by the given 
context. According to SFL, a text or an interaction develops through some steps called 
“stages” towards the goal. So, generic structure consists of some stages with transition 
which are necessary or optional to deal with for the text in the given context. Each 
stage further contains some moves in the interaction and the information of the move 
transition. Each move has the information of the possible speech-function in the inter- 
actants’ utterance. 

According to the situation types divided into two, the Context Base accommodates 
the primary generic structure and the secondary generic structure. The primary ge- 
neric structure describes the development of the dialogue between the computer user 
and the client secretary. So it consists of such stages as “greetings”, “task- 
identification”, “task-performance” etc. The secondary generic structure describes the 
process of creating the document according to the structure of the document. So it 
consists of such stages as “receiver-identification”, “task-confirmation”, “body- 
editing”, “picture-inserting” etc. These stages are embedded in a stage of primary ge- 
neric structure. 

The Context Base is supplemented with the Concept Repository, the Situation- 
Specific Dictionary, the Knowledge Base and the Corpus Base. The Concept Reposi- 
tory contains all the concepts which are used in a given context. Situation Specific 
Dictionary explains the linguistic features of the concepts registered in the Concept 
Repository. The Knowledge Base describes in natural language the knowledge used 
in the given context. It is analyzed in advance with the resource of the Semiotic Base, 
i.e. it is annotated with the contextual, linguistic (semantic and lexicogrammatical) 
and conceptual features. The Corpus Base is annotated with the resource of the Semi- 
otic Base just as the Knowledge Base. The difference between the Knowledge Base 
and the Corpus Base is that while the Knowledge Base has the standard form, the 
Corpus Base doesn’t. 

2.2 Data Structure 

The data of a situation type consists of the followings: ID of the situation. The fea- 
tures of Field, Tenor and Mode, upper situation types, embedded situation type, list of 
the Knowledge for Describing Situation (KDS), Concept Repository of the situation 
type and generic structure. 

Upper situation type describes the data of situation types in order to show the posi- 
tion of the given situation type. Embedded situation type shows what situation type is 
embedded in the given situation type. It implies that the embedded situation type is 
the second situation type while the embedding situation type is the primary situation 
type in a given interaction. 

The list of the KDS shows what kind of knowledge is deployed in the given situa- 
tion type. Each KDS is given its unique ID and is used by calling their ID when 
needed. 

Generic structure consists of ID of the generic structure and the list of the possible 
stages. A stage consists of the ID of the stage and the list of possible moves in it. A 
move consists of ID of the move, polarity of the move, the structure of the move, the 
list of the moves to which it is possible to transit. 

As for the move structure, it is shown as a conceptual frame of speech function and 
speech content. The speech function implies the interpersonal meaning the speaker 
conveys to the hearer and roughly corresponds to the “illocutionary act” in Speech 
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Act Theory. The speech function concept has the slots of speaker, hearer and content 
and the value of the content slot is the concept of the speech content, which implies 
the ideational meaning of the given move. 



3 Text Generation 

We have made a computational model of text processing using the Semiotic Base. We 
focus on text generation hereafter. For the algorithm of text understanding, see [5]. 

It is often the case that we cannot continue conversation without contextual knowl- 
edge. The following is the dialogue between a secretary and a client, where a client, 
computer user, asks a secretary agent to create a document. In this case A must take 
into account the contextual feature so that A could ask B again in utterance 3 whether 
or not s/he wishes to change the passage according to the addressee of the greeting 
card. In this case A must know that the nature of both B’s friends and colleagues: the 
social hierarchy and distance between A and his/her friends or colleagues. 

(1) A: syotyuu-mimai-o donata-ni okuri-masu-ka? 

summer-greeting-card-Acc. who-Dat. send-polite-question 
Who do you send the greeting cards? 

(2) B: yuujinn-to kaisya-no dooryoo desu. 

friends-and company-Gen colleague is 
To the friends and colleagues of my company. 

(3) A: syotyuu-mimai-no bunmen-o kae-masu-ka? 

summer-greeting-card-Gen passage-Acc change-polite-question 
Do you change the passage of the card? 

Linguistically generation is the inverse process of understanding. In generation we 
transform the linguistic system or potential into the linguistic instance. Selecting some 
conceptual features characterizing what to say, we realize these features as a text us- 
ing linguistic resources. 



3.1 Text Planning: Determining What to Say 

Text planning is the process of deciding the content which is to be generated. At this 
step, the algorithm begins with checking the conceptual analysis of user’s utterance. 
Then, the current move is checked with reference to the Stage Base and the values of 
context are investigated. At the next the candidates of the content of generation are 
picked up from the Stage Base and the most appropriate one is chosen. 

The content to be generated is given by an instance conceptual frame selected in 
the Stage Base. Taking the above utterance (3) as an example in the sequel, we will 
show the process of planning the content as a speech content and generating the text 
string. Fig. 2 shows the stage “receiver-identification” with four moves concerning 
the dialogue between a secretary and a client. 
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system’s 

ask-proposition 



system’s 
ask-proposition 2 



Fig. 2. Moves and Their Transition in a Stage: the Case of “Receiver-Identification” 



In this figure the moves 1 and 2 correspond to the utterance 1 and 2, respectively. 
In each move of the Stage Base concerning the system’s utterance, there is written a 
prototype of a conceptual frame to express a speech content. Suppose that we are at 
the move 2. The Utterance 2 of the user at the move 2 is found, by the conceptual 
analysis in the understanding model, to have the conceptual frame as shown in Fig. 3. 




inform-vaiue 



hearer 



content 



(^^^ecretary^^ Q^ddressee^^ 




Fig. 3. Conceptual Frame of Utterance 2 



Now we go to the next move. If there are plural move candidates after the move 2, 
we determine which move is the most appropriate. In this case, there are two candi- 
dates: the move 3 and the move 4. Fig. 4 shows two different speech contents for the 
secretary corresponding to the move 3 and the move 4. 

The next move is specified in the Stage Base and the way of determining the next 
move differs according to the stage or move to be dealt with. In this case, the move 
candidates are move 3 and move 4. These in the moves 3 and 4 can be read “Do you 
change the passage of the summer greetings card?” and “Is it OK that the passages of 
the cards are the same?” respectively. 

In order to determine the speech content for the utterance 3, we use the contextual 
features concerning Tenor and situational knowledge in the Context Base. According 
to the conceptual frame of the utterance 2, there are two tenors as addresses: friend 
and colleagues, and their social hierarchy and distance is found to be rather large. The 
situational knowledge concerning the content or style of a card says that the content 
may differ according to the social distance of addressees. As a result the speech con- 
tent in the move 3 is selected to generate the utterance 3. 
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Fig. 4. Conceptual Frame of Utterance 3 and 4 

In this case the move candidate is determined according to the difference of Tenor 
between the user and the addressee. Here we search for the KDS. The KDS used is as 
follows: 

<colleague> is <superior> to <user> in terms of <social hierarchy> 

<friend> is <equal> to <user> in terms of <social hierarchy> 

<colleague> is <distant> to <user> in terms of <social distance> 

<friend> is <close> to <user> in terms of <social distance> 

From these KDSs we determine the next move by identifying the TenorValue, 
which shows the difference of Tenor between the friends and the colleagues with the 
following rules: 

If the TenorValue is <large>, then set the CurrentMove to move3. 

If the TenorValue is <medium>, then set the CurrentMove to move4. 

If the TenorValue is <small>, then set the CurrentMove to move4. 

By these rules and identified TenorValue we determine the next move. In this case 
TenorValue is decided as <Iarge> and move 3 is selected as the next move. 

3.2 Text Generation: Determining How to Say 

After determining the content of generation, we generate the text string from the in- 
stance conceptual frame determined. 

First, we search for the Concept Repository to find a concept item corresponding to 
the concept “changing” in the conceptual frame for the utterance 3. From the corre- 
sponding item shown in Table 1, we obtain the semantic feature of the Process 
“changing” and Participants and Circumstances, where Process implies “verbs”, for 
instance, doing something in terms of SFL, Participants are like subject and object 
taking part in Process, and circumstances are like means and location for Process. In 
this example, referring to the term “changing” in the Concept Repository, we can find 
the semantic feature “fg-contact” of Process “changing” and further we can obtain the 
SFL roles of Participants, “Actor” for agent and “Goal” for object. In this case there is 
no Circumstance with this Process. Fig. 5 illustrates this process. 
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Table 1. Concept Repository 



Head concept identifier 


changing 




Concept type 


class 




EDR concept identifier 




0e87a2 3ceec7 3cf4fc 3cf90d 


MB features 


fg-contact 




WB features 


mat-contact 




Upper Concept name 


doing 






Slot name 


Slot value 


SEE role 


1 


agent 


agent 


Actor 


2 


object 


domain-concept 


Goal 







semantic feature of "changing" 



Fig. 5. Correspondences of Concepts with Semantic Roles 

At the step we obtain the followings: 

changing — > semantic feature of “changing” 
agent — > Actor 
object ^ Goal 

Next, we obtain the lexicogrammatical feature corresponding to the semantic fea- 
ture obtained at the last step, and case markers and the word order. At this step we ob- 
tain the lexicogrammatical feature of the Wording Base from the realization statement 
attached to the corresponding semantic feature obtained at the last step in the Meaning 
Base. Then, we refer to the realization statement (the linguistic constraint attached to 
the feature in the database) to the Wording Base and looks up the constraints about 
the case markers and the word order. 

In this example we find the lexicogrammatical feature “mat-contact” in the Word- 
ing Base from the realization statement attached to the semantic feature “fg-contact” 
in the Meaning Base. Then, referring to the realization statement corresponding to the 
obtained feature “mat-contact”, we can find the case markers and word order. Fig. 6 
illustrates this process. 

The result of this step is as follows: 

semantic feature of “changing” — > lexicogrammatical feature of “changing” 

Actor — > ga (nominal case marker) 

Goal — > o (accusative case marker) 

The word order : [user (Actor)-ga body (Goal)-o changing] 
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case marker: A*i(ga) case marker: $(o) 
[ user(Actor)-ga body(Goal)-o changing] 



Fig. 6. Case Markers and Word Order 

Further we follow some lexicogrammatical constraints, if any: ellipsis, thematiza- 
tion, for instance. In this example a system network in the Wording Base says that 
Japanese subject which takes the role of Actor and which is marked with nominal 
case should become theme in the clause and this type of theme should be elliptical by 
default. So, the part of “user” is deleted and we do not deal with this part any longer. 

Next we determine the fine structure of the Process and Participants. So far the 
rough structure of the text string has been determined, but the details of the text string, 
especially the internal structures of each Participant, Circumstance or Process remain 
unspecified. At this step, we relate the grammatical categories, concepts and, if possi- 
ble, lexical items to the participants, circumstances or process. 




Fig. 7. Relation of Participant “Goal” to Grammatical Categories and Concept Instance 



The last step is the lexical selection. At this step, we look up the most appropriate 
exical items of concept instance in the Situation-Specific Dictionary. We first search 
for the candidates of the lexical item referring to concept labels and situation types 
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Table 2. Structure of the Output Sentence 



Part. 


Goal 


Process 


Gr. Cat. 


Flead 


Binder 


Head 


Modifier 


Head 


Negotiator 


concepts 


summer- 

greeting- 

card 




body 




changing 


ask- 

proposition 


lexicon 




©(no) 




Sr(o) 




;()Kka) 



described in the Situation Specific Dictionary. Then, we select the most appropriate 
candidate with the frequency information in the Situation-Specific Dictionary. 

For example, as for “changing”, the result of searching for the candidates of lexical 
item is shown below in Table 3. From the candidates, we choose the most frequent 
item “kae-masu”. 



Table 3. Lexical Items for “Changing” 



Index 


Conceptual label 


frequency 


4 "T (aratame-masu) 


changing 


0.2 


LS’T (henkoo-simasu) 


changing 


0.3 


(kae-masu) 


changing 


0.5 



The same procedure is taken to find for the lexical candidates of “body” and 
“summer-greeting-card”. These are found to be “bunmen” and “syotyuu-mimai”, re- 
spectively. The final result of all process of text generation is as shown in Table 4. 



Table 4. Output of the Text Generation Process 



Part. 


Goal 


Process 


Gr. Cat. 


Head 


Binder 


Head 


Mod 


Head 


Negotiator 


concepts 


summer- 

greeting-card 




body 




changing 


ask- 

proposition 


lexicon 


(syootai-zyoo) 


(D 

(no) 


(bunmen) 


(o) 


(kae-masu) 


(ka) 



4 Conclusion and Future Works 

In this paper, after explaining the basic conceptions of Everyday Language Comput- 
ing and SFL, we showed the contents and overall structure of the Context Base as a 
subcomponent of the Semiotic Base. Then we showed how the Context Base works in 
text generation. The Semiotic Base is context sensitive due to the Context Base. The 
Context Base uses the contextual features in term of SFL comprehensively and ex- 
plicitly. The supplemental components such as MRDs, the Knowledge Base and the 
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Corpus Base make the Context Base work effectively in the text processing. We have 
already implemented the prototype system by which a user can operate word proces- 
sor with his/her everyday use of language [6]. In this system, the Concept Repository 
for creating document contains about 200 class concepts and the number of them is 
expected to expand. 

There are some previous works on text generation adopting SFL. Among them, for 
example, COMMUNAL [8] and Multex [9] considers the contextual feature. 
COMMUNAL is sensitive to Tenor and Mode, which enables the generator to deal 
with some linguistic variations for the same content of generation. However, it still 
does not deal with the contextual features comprehensively and does not have seman- 
tic stratum as the Semiotic Base has. Multex considers both the contextual and seman- 
tic contents in terms of SFL. However, the contextual features of this system are lim- 
ited to the level of situation types in terms of SFL. The Semiotic Base is more 
comprehensive than this in that it considers the values itself of Field, Tenor and 
Mode, and generic structures. 

We still have several works to do. Among them, it is the most significant theoreti- 
cally how to link from the Context Base to other components of the Semiotic Base. 
The bridge between the Context Base and the Meaning Base/the Wording Base is now 
Concept Repository and Situation-Specific Dictionary: the linguistic realizations of a 
given situation are shown as the concepts and their linguistic representation as the 
supplement of the main database. But ideally the linguistic realization of context 
should be indicated in the system network in the Context Base just as the lexico- 
grammatical realizations of the semantic features are indicated in the realization 
statements in the Meaning Base. We are now investigating how to show the linguistic 
realization of a given context directly in the Context Base. 

It is also an important issue how to deal with the situational knowledge such as 
KDS. We have compiled situational knowledge them manually and the amount of the 
compiled knowledge is extremely small relative to deal with all possible actions in the 
given context. We must compile the situational knowledge further. At the same time 
we must consider what kind of knowledge is really needed in the given context so that 
we may avoid explosion. 

As for the application to text generation, the problem of input specification is the 
most significant. Our present model starts from a conceptual frame as the content of 
generation by which the meaning features are assigned. What is a trigger to this selec- 
tion process? How is the content of generation formed and expressed? We have to 
identify the input for selecting meaning features, a sort of intention, together with its 
representation for the computational model of generation. 
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Abstract. We propose a parser based on ideas from the Minimalist Pro- 
gramme. The parser supports free word order languages and simulates 
a human listener who necessarily begins sentence analysis before all the 
words in the sentence have become available. We first sketch the prob- 
lems that free word order languages pose. Next we discuss au existing 
framework for minimalist parsing, and show how it is difficult to make 
it work for free word order languages and simulate realistic syntactic 
conditions. We briefly describe a formalism and a parsing algorithm that 
elegantly overcome these difficulties, and we illustrate them with detailed 
examples from Latin, a language whose word order freedom causes it to 
exhibit seemingly difficult discontinuous noun phrase situations. 



1 Introduction 

The Minimalist Programme as described by Chomsky and others [6] seeks to 
provide an explanation for the existence of the capacity of language in humans. 
Syntax merits particular attention in this programme, as it is syntax that me- 
diates the interactions between the requirements of articulation and those of 
meaning. Minimalism characterizes syntax as the simplest possible mapping be- 
tween semantics and phonology. It seeks to define the terms of simplicity and 
determine the structures and processes required to satisfy these terms. 

That the object of investigation is syntax suggests that it is possible to ex- 
tract a formal and computable model of syntactic processes from minimalist 
investigations. Doubly so, as the concept of economy in minimalism can be said 
to correspond to computational complexity. 

In this paper, we look at the problem of developing a minimalist account of 
parsing. We turn our attention in particular to free word order phenomena in 
Latin and to simulating a realistic human parser given that people do not have a 
complete sentence before they begin processing. We first give background on free 
word order phenomena and on minimalist parsing. Next we discuss a formalism 
that performs free word order parsing in such a realistic manner. We show two 
complete parses of representative sentences to demonstrate the algorithm, one 
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of which is a case of discontinuous constituency, a special challenge for parsing 
languages such as Latin. 

2 Background 

2.1 Latin and Free Word Order 

Principle-based parsing of free word order languages has been considered for a 
long time — see, for example, [1] — but not much in the context of the Minimalist 
Programme. We propose a minimalist parser and illustrate its operation with 
Latin, a language that exhibits high word order freedom. For example, 

pater laetus amat filium laetum 

father-Nom happy-Nom loves-3Sg son-Acc happy-Acc 
‘The happy father loves the happy son.’ 

For a simple sentence such as this, in theory all 5! = 120 permutations should 
be grammatical. We briefly discuss in section 4 whether this is really the case. 
This is not true of Latin sentences in which function words must be fixed in place. 

It is often remarked that these word orders, though semantically equivalent, 
differ in pragmatic import (focus, topic, emphasis, and so on). Existing elaborate 
accounts of contextual effects on Latin word order [3] are rarely sufficiently 
formal for parser development, and do not help a parser designed to extract 
information from a single sentence out of context. It should still be possible to 
extract the propositional content without having to refer to the context; hence, 
we need an algorithm that will parse all 120 orders as though there were no real 
differences between any of them. After all, people can extract information from 
sentences out of context. 

2.2 Derivational Minimalism 

[4] defines a minimalist grammar as G = {V, Cat, Lex, F). E is a set of “non- 
syntactic features” (phonetic and semantic representations). Cat is a set of “syn- 
tactic features,” Lex is the lexicon (“a set of expressions built from V and CaC), 
and F is “a set of partial functions from tuples of expressions to expressions” — 
that is, structure-building operations such as move and merge, merge, a binary 
operation, composes words and trees into trees, move removes and reattaches 
subtrees; these manipulations are performed in order to . features. Checking 
ensures, among other things, that words receive required complements. Stabler 
characterizes checking as the cancellation of corresponding syntactic features on 
the participant lexical items, but requires that features be checked in a fixed 
order. 

[5] proposes a minimalist recognizer that uses a CKY-like algorithm to de- 
termine membership in L{G). Limiting access to features to a particular order 
may not work for free word order languages, where words can often appear in 
any order. Either duplicate lexical entires must proliferate to handle all cases, or 
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Stabler’s specification requirements of lexical entries must be relaxed. We choose 
the latter path. 

Stabler’s CKY inference rules do not themselves directly specify the order in 
which items are to be MERGEd and MOVEd into tree structures. This apparent 
nondeterminism is less significant if the feature order is fixed. Since we pro- 
pose a relaxation of the ordering constraint for free word order languages, the 
nondeterminism will necessarily be amplified. 

This dovetails with a practical goal. In free word order languages, semanti- 
cally connected words can appear far apart in a sentence. In 

, and must be merged at some point in a noun-adjective 
relationship. But is the subject of . In order to simulate a human lis- 
tener, the parser would have to merge and first, despite that 
and form one NP; the human listener would hear and make the connection 

between and first . 

Consequently, we propose a parser that simulates a human listener by limiting 
at each step in a derivation what words are accessible to (“have been heard 
by”) the parser and by defining the precedence of the operators in a way that 
fully extracts the syntactic and semantic content of the known words before 
receiving further words. We develop a formalism to allow this while providing the 
flexibility needed for free word order languages. This formalism, illustrated later 
by examples, reflects many of the ideas of Stabler’s formalism, but is otherwise 
independent. 

3 The Parser 

3.1 Lexicalized Grammatical Formalism 

We briefly describe enough of the lexicon structure to assist in understanding 
the subsequent parsing examples. A lexical entry has the form 

a : r 

a is a word’s phonetic or orthographic representation as required. T is a 
set of feature structures, henceforth called “feature sets.”^ F contains feature 
paths, described below. But more fundamental are the feature bundles from 
which feature paths are constructed. 

Feature bundles are required given the fact that highly inflected languages 
often compress several features into a single morpheme. Feature bundles provide 
the means to check them simultaneously. A feature bundle is represented as 
follows: 



^ There is a third, hidden entity here: the semantic representation of a. We leave it 
implied that parsing operations perform semantic composition; the formal specifica- 
tion of this is left for future work, but it can be specified in the lambda calculus as 
in [2] or via theta roles, and so on. Nevertheless, this paper is ultimately directed 
towards laying the syntactic groundwork for the extraction of semantic content from 
free word order sentences. 
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P{ti : (pn) ■ 6,n>l 

(3 is the feature checking status. It can be unchecked (unch), UNCHECKED (unch), 
checked (ch), CHECKED (CH), unchecked-adjoin (unch+), checked-adjoin (ch+). When 
feature bundles are checked against one another, (3 can change. The examples 
will illustrate the relation between feature checking status symbols during the 
checking. 

Each T is a feature type. It can be one of such things as case, gender, number, 
and so on. Each (/) is a feature value such as Nom (nominative), Sg (singular), 
and so on. The correspondence between feature types and features is required 
for the unification aspect of the checking operation, again demonstrated in the 
examples. 

6 is direction. An item i carrying the feature bundle can only check the 
bundle with a bundle on another item to the 6 oi l. 5 can be or , and b 
is omitted when the direction does not matter (a frequent situation in free word 
order languages). 

A feature path has the form: 



TT —>■ r 

7T is a feature bundle and T is a feature set. A feature path makes F inacces- 
sible for checking until tt has been checked. T can be empty, in which case the 
^ is not written. 

A feature set is simply an unordered list of unique feature paths: {rj, . . . }. 
These sets allow each ij to be checked independently of the others. In our repre- 
sentation of lexicon entries, we leave out the braces if there is only one rj. 

3.2 Data Structures and Operations 

A parse (also known as a derivation) consists of a number of steps. Each step is 
of the form: 

F I F 

F is the queue of incoming words. It is used to simulate the speaker of a 
sentence. The removal of words is restricted to the left end of the queue. A word 
is shifted onto the right end of given conditions described below. Shifting is 
equivalent to “hearing” or “reading” the next word of a sentence. ^ is a list, 
the processing buffer. It is a list of trees whereon the parser’s operations are 
performed. The initial state of a parse has an empty <P, and the final successful 
state has an empty F and a single tree in F without any unchecked features. A 
parse fails when F is empty, has multiple trees or unchecked feature bundles, 
and no operations are possible. 

Tree nodes are like lexicon entries, maybe with some features checked. The 
form is [a T] if it is the node with word a closest to the root, or a if it is a lower 
node. A single node is also a tree. 

At each step, the parser can perform one of three operations: move a node 
on a tree to a higher position on the same tree, merge two adjacent trees, or 
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shift a word from the input queue to the processing buffer in the form of a node 
with the corresponding features from the lexicon. 

MERGE and MOVE are well known in the minimalist literature, but for parsing 
we present them a little differently. Their operation is illustrated in the subse- 
quent examples. In this parser, merge finds the first^ two compatible trees from 
the processing buffer with roots a and /3, and replaces them with a tree with 
either a or /3 as the root and the original trees as subtrees, move finds a position 
a in a tree (including the possible future sister of the root node) that commands^ 
a compatible subtree; the subtree is replaced by a trace at its original position 
and merged with the item at its new position. Adjunct movement and specifier 
movement are possible. Our examples only show adjunct movement, move acts 
on the first tree for which this condition exists. 

There are locality conditions for merge and move. For merge, the trees 
must be adjacent in the processing buffer. For move, the targeted tree positions 
must be the ones as close as possible to the root. 

Compatibility is determined by feature checking. It relies on unification and 
unifiability tests. Sometimes checking succeeds without changing the checking 
status of its participants, because further checking may be required. These as- 
pects of checking are also described in the examples. 

At every step, move is always considered before merge. This is to minimize 
the number of feature-compatible pairs of movement candidates within the trees 
in the processing buffer. If no movement is available in any tree, the parser looks 
for adjacent candidates to merge, starting from left to right. If neither merge 
nor MOVE is possible, a new word is shifted from the input queue. 



3.3 Examples of Parsing 

Here is the initial lexicon: 

pater: unchCcase : Nom, num:Sg, gnd:Masc) 
filium: unch(case : Acc , num:Sg, gnd:Masc) 
amat : {UNCHCcase :Nom, num:Sg), UNCH(case: Acc)} 
laetus: unch+(case :Nom, mim:Sg, gnd:Masc) 
laetum: unch+ (case : Acc , mim:Sg, gnd:Masc) 



Inflection in Latin is often ambiguous. More entries for would exist 

in a realistic lexicon. As disambiguation is not in the scope of this paper, we 
assume that the only entries in the lexicon are those useful for our examples. 



Example #1; . This example illustrates the ba- 

sic machinery of the parser, but it also demonstrates how the parser handles the 
discontinuous constituency of phrases, in this case noun phrases. . 

(“the happy son”) is split across the verb. We begin with an empty processing 
buffer: 



^ We scan from the left. Since the trees closer to the left end of $ tend to have their 
features already checked, processing usually affects recently shifted items more. 

® Node ^ commands node if ^’s sister dominates (^. 
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I pater laetus filium amat laetrnn 



and are shifted into the buffer one by one. They are “heard”; the 

lexicon is consulted and the relevant features are attached to them. (For the sake 
of brevity, we will combine multiple steps, particularly when a word is heard and 
some MERGE happens immediately as a result.) 



[pater unchCcase :Nom, num:Sg, gnd:Masc)] 
[laetus unch+(case :Nom, num:Sg, gndrMasc)] 
I filium amat laetum 



adjoins to . There is no unifiability conflict between the features 
of both words. If a conflict had occurred, there would be no merge. The lack 
of conflict here indicates that the adjunction is valid. Thus, the feature on 
the adjective is marked as checked. Since adjunction is optional, and further 
adjunctions are theoretically possible, the feature on the noun is not checked 
yet. 



([pater unch(case:Nom, num:Sg, gnd:Masc)] 
pater 

[laetus ch+ (case : Norn, num:Sg, gnd:Masc)]) 
I filium amat laetum 



We shift , into the buffer. , cannot be absorbed by the tree. 
So we shift . Their nodes look as follows: 

[filium unch(case : Acc , numrSg, gnd:Masc)] 

[amat {UNCH ( case : Norn, num:Sg), UNCH(case: Acc)}] 

Can anything be merged? Yes, . checks a feature bundle on that 
was the one looking for another compatible bundle in order to project to the 
root of the new tree. When , ’s feature bundle checks with the corresponding 
bundle on , several things occur: 

1. ’s feature bundle’s status changes from UNCH to CH, and , ’s bundle’s 
status changes from unch to ch. 

2. projects: a new tree is formed with and its features at the root. 
This is specified in the feature bundle: the capitalized form indicates that 

is looking for a constituent to fill one of its semantic roles. 

3. ’s feature bundle is unified with that of . and replaced with the 

unification result; in other words, it acquires . ’s gender. , does 
not gain any features, as its bundle is not replaced with the unification 
result. Only the projecting item is altered, as it now dominates a 

tree containing , as a constituent. The non-projecting item becomes 
a subtree, inaccessible for merge at the root and thus not needing to reflect 
anything about 

Table 1 describes the interactions between feature bundle checking status 
types. In all four cases, only the item containing bundle 2 projects and forms 
the root of the new tree. A unifiability test occurs in each checking operation, but 
replacement with the unification result happens only in the feature checked on 
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the projecting item (bundle 2) in the first case. No combinations other than these 
are valid for checking. The relation only allows us to check unch with UNCH feature 
bundles, since UNCH bundles indicate that their bearers project; there must be 
exactly one projecting object in any merge or move. unch+ check with CH feature 
bundles, because their CH (and feature-compatible) status will have resulted from 
a merge with an item that can accept the adjunct in question. unch+ check with 
any compatible unch or ch feature bundles, as this indicates that the target of 
adjunction has been reached. We consider this analysis of checking relations to 
be exhaustive, but we save the rigorous elimination of other combinations for 
future work. 



Table 1. Checking status interactions 



Bundle 1 Bundle 2 after checking: Bundle 1 Bundle 2 Replace bundle 2 

with unif. result? 



unch 


UNCH 


ch 


CH 


Y 


unch+ 


CH 


unch+ 


CH 


N 


unch+ 


unch 


ch+ 


unch 


N 


unch+ 


ch 


ch+ 


ch 


N 



Here is the result of the merge of , and (to save space, we omit 
the tree): 

( [amat {UNCH(case:Nom, numrSg), 

CH(case:Acc, mim:Sg, gnd:Masc)}] 

[filium chCcase : Acc , numrSg, gnd:Masc)] 
amat) 

I laetum 



MERGE occurs at the roots of trees. It treats each tree as an encapsulated 
object and does not for places within trees to put objects. In more com- 

plicated sentences, searching would require more involved criteria to determine 
whether a particular attachment is valid. For the sake of minimality, we have 
developed a process that does not require such criteria, but only local interaction 
at a surface level. In doing so, we preserve our locality requirements. 

The only attachments to trees that are valid are those that are advertised 
at the root, move within trees takes care of remaining checkable features given 
criteria of minimality described above. 

Now, is adjacent to and can thus merge with it, checking the 

appropriate bundle. 

([amat {CH(case:Nom, num:Sg, gnd:Masc), 

CH(case:Acc, mim:Sg, gndrMasc)}] 

([pater ch(case:Nom, num:Sg, gnd:Masc)] 
pater 

[laetus ch+(case:Nom, numrSg, gndrMasc)]) 

(amat 

[filium ch(case:Acc, numrSg, gndrMasc)] 
amat) ) 

I laetum 
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In the processing buffer, no more features can be checked. The system needs 
to process . It moves in and presents a problem: 

[laetum unch+(case : Acc , mim:Sg, gndrMasc)] | 

To what can attach? We have defined the merge operation so that it 

cannot search inside a tree — it must operate on objects in the buffer. Fortunately, 
the rules we have defined allow an item with an adjunct feature bundle to be 
merged with another item with a ch projecting feature bundle if both bundles can 
be correctly unified. The adjunct feature remains unch until movement causes a 
non-projecting non-adjunct feature to be checked with it. 

( [amat {CH (case: Norn, nmn:Sg, gnd:Masc), 

CH(case:Acc, numrSg, gnd:Masc)}] 

(amat 

([pater ch(case:Nom, mim:Sg, gndrMasc)] 
pater 

[laetus ch+(case :Nom, num:Sg, gndrMasc)]) 

(amat 

[filium ch(case:Acc, numrSg, gndrMasc)] 
amat) ) 

[laetum unch+ (case: Acc, numrSg, gndrMasc)]) I 

The rules for movement seek out the highest two positions on the tree that 
can be checked with one another. , and are precisely that. . is 

copied, checked, and merged with . For the sake of convention, we mark 

the original position of, with a trace (<, >). 

([amat {CH(case:Nom, numrSg, gndrMasc), 

CH(case:Acc, numrSg, gndrMasc)}] 

(amat 

([pater ch(case:Nom, numrSg, gndrMasc)] 
pater 

[laetus ch+(case :Nom, numrSg, gndrMasc)]) 

(amat 

<f ilium> 
amat) ) 

([filium ch(case:Acc, numrSg, gndrMasc)] 
filium 

[laetum ch+(case : Acc , numrSg, gndrMasc)]) ) I 

dominates because is only an adjunct. All features are now 

checked, and the parse is complete. 

Example #2: . This sentence is in the passive 

voice, included to demonstrate the need for feature paths. Passives in Latin are 
very similar to passives in English. is similar to an agent , -phrase. 

This requires new lexicon entries for all the words except for and 

The additional entries: 



a: UNCH(case: Abl) rright — > unch(byrO) 

filio: unch(case : Abl , numrSg, gndrMasc) 

laeto: unch+(case : Abl , numrSg, gndrMasc) 

amatur: {UNCH(case :Nom, numrSg, gndrMasc), UNCH(by:0)3- 

The , -feature is similar to Niyogi’s [2] solution for , -phrases in English. 0 
is there as a place-holder, since the , -feature does not have multiple values; it 
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is either present or absent. We also use 0 to indicate that the feature be 
present in order for the feature bundle to be unifiable with a corresponding UNCH 
feature bundle, is a preposition in Latin with other uses (like , in English); 
making it necessarily attach to a verb as the deliverer of an agent requires a 
special feature. 

The , -feature is the second element along a feature path. Before can be 
merged with a verb, it must first be merged with a complement in the ablative 
case. This complement is directionally specified (to the right of ). 

Let us begin: 

I pater laetus a filio laeto amatur 

and are each shifted to the processing buffer. They adjoin: 

([pater unch(case:Nom, num:Sg, gnd:Masc)] 
pater 

[laetus ch+ (case : Norn, num:Sg, gndrMasc)]) 

I a filio laeto amatur 



and , enter; , is a noun in the ablative case and can be checked against 
. (We will henceforth omit the tree until it becomes necessary.) 

([a CH(case:Abl, num:Sg, gnd:Masc) : right — > unch(by:0)] 
a 

[filio ch(case:Abl, num:Sg, gndrMasc)]) 

I laeto amatur 

checks the case feature on immediately. There will be an adjective that 
needs to adjoin to . , but it has not yet been heard; meanwhile, the system has 

a preposition and a noun immediately ready to work with. The mechanism of 
unification allows to advertise the requirements of . for future adjunction. 
There is no reason for the system to wait for an adjective, as it obviously cannot 
know about it until it has been heard. The noun does not need an adjective and 
only permits one if one is available. 

As before, is absorbed and attached to . 

([a CH(case:Abl, numrSg, gndrMasc) : right — > unch(byrO)] 

(a 



[filio ch(caserAbl, numrSg, gndrMasc)]) 
[laeto unch+(caser Abl, numrSg, gndrMasc)]) 

I amatur 



This adjunction occurs because unification and replacement have caused 
to carry the advertisement for a Sg, Masc adjunct in its now-CH feature that 
previously only specified ablative case. 

MOVE connects , and , leaving <, >: 

([a CH(caserAbl, numrSg, gndrMasc) r right — > unch(byrO)] 

(a 



<f ilio>) 

([filio ch(caserAbl, gndrMasc, numrSg)] 
filio 

[laeto ch+(case r Abl , numrSg, gndrMasc)]) ) 



I amatur 
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The buffer now contains two trees (remember the tree) . Neither of their 
roots have any features that can be checked against one another. is heard: 

[amatur {UNCH(case :Nom, mim:Sg, gnd:Masc), UNCH(by : 0) )■] I 

The case feature was checked on , so the , -feature is available for checking 
with the verb. Recall that a merge in the processing buffer only occurs between 
adjacent elements. In the next step can only merge with . 

([amatur {UNCH(case:Nom, num:Sg, gnd:Masc), 

CH(by:0, case:Abl, numtSg, gnd:Masc)}] 

([a CH(case:Abl, num:Sg, gnd:Masc) :right — > ch(by:0)] 

(a 



<f ilio>) 

([filio ch(case:Abl, gnd:Masc, num:Sg)] 
f ilio 

[laeto ch+(case : Abl , num:Sg, gnd:Masc)]) ) 

amatur) I 

The , -feature on has been unified with all the features on the feature 

path of , required in case the order had been 

In that situation, would have been merged with before would 

be heard, since is an adjunct. This mechanism ensures that permission for 
the attachment of an adjunct to . is exposed at the root of the tree dominated 
by . Here it is not an issue, but the operator covers this possibility. 

Merging with (which we now reintroduce) is the next and final step: 

([amatur {UNCH(case:Nom, num:Sg, gnd:Masc), 

CH(by:0, case:Abl, num:Sg, gnd:Masc)}-] 

([pater unch(case:Nom, numtSg, gnd:Masc)] 
pater 

[laetus ch+(case:Nom, numrSg, gnd:Masc)]) 

(amatur 

([a CH(case:Abl, num:Sg, gnd:Masc) :right — > ch(by:0)] 

(a 



<f ilio>) 

([filio ch(case:Abl, gnd:Masc, num:Sg)] 
filio 

[laeto ch+(case: Abl, numrSg, gnd:Masc)]) ) 

amatur) ) I 



4 Conclusions and Future Work 

Through examples, we have presented an algorithm and formal framework for 
the parsing of free word order languages given certain limitations: highly con- 
strained operations with strong locality conditions (move and merge), no at- 
tempts at simulating nondeterminism (such as lookahead), and a limitation on 
the availability of the words in the sentence over time (the input queue simu- 
lates a listener processing a sentence as words arrive). Under these limitations, 
we demonstrated the algorithm for a sentence with a discontinuous noun phrase 
and one in the passive voice. 

The precedence of operations serves to connect shifted items semantically as 
soon as possible, though we do not fix the semantic representation; whenever 
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MERGE and MOVE are performed, we assume that a more complete semantic 
representation of the sentence is achieved. We will seek compositional semantic 
formalisms that can handle the flexibility of our syntactic formalism. 

The algorithm favours move to ensure that all features in the current set 
of trees in the processing buffer are exhausted. This precludes multiple pairs of 
compatible subtrees, since that can only happen when merge joins a new item 
to a tree without exhausting the available movement candidates. 

This has the effect of creating a “shortest move condition,” which [5] enforces 
by only allowing one instance of a pair of compatible features in a tree. Multiple 
compatible pairs can only arise when a movement frees a feature set along a 
feature path, and the feature bundles in the feature set are each compatible 
with different commanded subtrees. The highest commanded subtree moves first, 
since it is the first found in a search from the tree root. This exception is required 
by the mechanisms we use to handle free word order languages, which Stabler 
does not consider. We conjecture that this may only appear in practice if there 
are actual adjunct attachment ambiguities. 

The locality of merge (adjacent items only) precludes search for every pos- 
sible pair of movement candidates on the list, merge’s precedence over shifting 
and locality together have the effect of forcing merge of the most recently shifted 
items first, since they bring in new unchecked features. (This fits the intuition 
that semantic connections are the easiest among the most recently heard items) . 
Shifting last prevents occurrences of multiple compatible candidates for merging. 

We will thoroughly demonstrate these claims in forthcoming work; here, we 
trust the reader’s intuition that the definitions of these operations do prevent 
most ambiguous syntactic states, barring those caused by attachment or lexical 
ambiguities with semantic effect. 

Feature paths are used to force checking certain feature bundles in certain 
orders, usually so that a word’s complements can be found before it is absorbed 
into a tree. Feature sets allow the opposite; free word order languages let most 
features be checked in any order. Unification and unifiability tests provide suf- 
ficient flexibility for split phrases by allowing dominating nodes to advertise 
outstanding requirements of their descendants. 

A less general version of this parsing algorithm, written in Prolog and de- 
signed specifically for these example sentences, showed that 90 of the 120 per- 
mutations of can be parsed. , 

typifies the remaining 30, in that the adjective cannot be 

absorbed by , since has not yet merged with , ; the 30 all have 

similar situations caused by the adjacency condition we impose on merging. 
These sentences are a subset of those exhibiting discontinuous constituency. 

In a further experiment, we allowed implied subjects (for example, omitting 
is grammatical in Latin in example #1). This reduced the number of 
unparsable sentences to 16. was still in all the sentences, but in the 14 

that became parsable, and were originally obstructed from merging. 

We lifted the obstruction by allowing to merge with first. Without 
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implied subjects, acted as an obstacle in the same way as ; it ceased 

to be an obstacle after we introduced implied subjects. 

Though we are able to parse most of the sentences in this constrained environ- 
ment (and thus most examples of discontinuous constituency), we are working on 
determining how to minimally weaken the contraints or complexify the algorithm 
in order to handle the remaining 16 orders. But before deciding how to modify 
the system, we need to determine how many of these orders are actually valid 
in real Latin, and thus whether modifications to the system are really justified. 
We have embarked on a corpus study to determine whether these orders were 
actually plausible in classical Latin; a corpus study is necessitated by the lack 
of native speakers. We work with material generously provided by the Perseus 
Project (http : //www . perseus . tufts . edu/) . 

This paper discusses the parser as an algorithm. In our pilot implementation, 
we simulated the checking relations between the Latin words that we used to 
experiment with the algorithm. We are now implementing the full parser in SWI 
Prolog using Michael Covington’s GULP package to provide the unification logic. 
We will also seek a way to convert existing comprehensive lexica into a form 
usable by our parser, both for Latin and for other languages. 

Work in minimalist generation and parsing has, thus far, mostly stayed within 
the limits of theoretical linguistics. A parser with the properties that we propose 
would help broaden the scope of the study of minimalist parsing to more realis- 
tic, complex linguistic phenomena. It could take this parsing philosophy toward 
practical applications. An example is speech analysis, where it would be advan- 
tageous to have a parsing algorithm that recognizes the need to make syntactic, 
and thus semantic, links as soon as a word enters the system. 
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Abstract. This paper presents a study aiming to find out the best strategy to 
develop a fast and accurate HMM tagger when only a limited amount of training 
material is available. This is a crucial factor when dealing with languages for 
which small annotated material is not easily available. 

First, we develop some experiments in English, using WSJ corpus as a test- 
bench to establish the differences caused by the use of large or a small train set. 
Then, we port the results to develop an accurate Spanish PoS tagger using a limited 
amount of training data. 

Different configurations of a HMM tagger are studied. Namely, trigram and 
4-gram models are tested, as well as different smoothing techniques. The perfor- 
mance of each configuration depending on the size of the training corpus is tested 
in order to determine the most appropriate setting to develop HMM PoS taggers 
for languages with reduced amount of corpus available. 



1 Introduction 

PoS Tagging is a need for most of Natural Language applications such as Sumarization, 
Machine Translation, Dialogue systems, etc. and the basis of many higher level NLP 
processing tasks. It is also used to obtain annotated corpora combining automatic tagging 
with human supervision. These corpora may be used for linguistic research, to build better 
taggers, or as statistical evidence for many other language-processing related goals. 

PoS tagging has been largely studied and many systems developed. There are some 
statistical implementations [5,6, 11, 13, 12, 1] and some knowledge-based taggers (finite- 
state, rule-based, memory based) [8,2,7], There are also some systems that combine 
different implementations with a voting procedure. 

This work presents a thorough study aiming to establish which is the most appropiate 
way to train a HMM PoS tagger when dealing with languages with a limited amount 
of training corpora. To do so, we compare different smoothing techniques and different 
order HMMs. 

Experiments are performed to determine the performance of the best configuration 
when the tagger is trained with a large English corpus (1 million words from WSJ), 
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and comparing the results with those for a training corpus ten times smaller. Then, 
the experiments for the small train corpus are repeated in another language (Spanish), 
validating the conclusions. The tested HMM configurations vary on the order of the 
model (3 and 4 order HMMs are tested) and in the smoothing technique (Lidstone’s law 
vs. Linear Interpolation) used to estimate model parameters. 

Section 2 presents the theoretical basis of a HMM and the different smoothing tech- 
niques used in this work. Section 3 shows the realized experiments and the obtained 
results. Section 4 states some conclusions and further work. 



2 Hidden Markov Models 

We will be using Hidden Markov Models Part-of-Speech taggers of order three and four. 
Depending on the order of the model, the states represent pairs or triples of tags, and 
obviously, the number of parameters to estimate varies largely. As the states are pairs 
or triples of tags, the possible number of states are the possible combinations of all the 
tags in groups of two or three. Table 1 shows the number of tags for each language and 
the consequent number of potential states. This number is very large but there are many 
states that will never be observed. The emmited symbols are words, which we estimated 
to be about 100,000 for English and 1,000,000 for Spanish. 

Table 1. Number of potential tags and states for English and Spanish with both models 





English 


Spanish 


Number of tags 


47 


67 


Number of potential states in a 3-gram HMM 


2,209 


4,489 


Number of potential states in a 4-gram HMM 


103,823 


300,763 



The parameters of such models are initial state probabilities, state transition proba- 
bilities and emission probabilities. That is: 

TTi = P{qi = Si) 

is the probability that a sequence starts at state Si, 

fly = P{qt+i = Sj\qt = Sj) 

is the transition probability from state i to state j (i.e. trigram probability P{t^\tit 2 ) in 
a order model, or 4-gram probability P{t 4 \txt 2 tz) in a 4-gram HMM), and 

bi{k) = P{wk\si) 

is the emission probability of the symbol Wk from state Si. 

In the PoS task model, the emitted symbols are the observed words and the observed 
sequence is a complete sentence. Given a sentence, we want to choose the most likely 
sequence of states (i.e. PoS tags) that generated it. This is computed using the Viterbi 
algorithm [14]. 
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2.1 Parameter Estimation 

The simplest way to estimate the HMM parameters is Maximum Likelihood Estimation 
(MLE), which consists in computing observation relative frequencies: 



Pmle{x) 



count{x) 

N 



PMLE{x\y) 



count{x, y) 
count{x) 



Eor the case of the HMM we have to compute this probability estimation for each 
initial state, transition or emission in the training data: 



count{si{t = 0)) 
count{sentences) 



countisi Si) 

~ 77 \ 

count(Si) 

count{si,Wk) 

Hk) = ,, . 

count [Si) 

where count{si{t = 0)) is the number of times that st is visited as an initial state, 
count(sentences) is the number of sentences, count{si Sj) is the count of the tran- 
sitions from state Sj to Sj, count(si) is the number of visits to state Si and count{si, Wk) 
is the number of times that symbol Wk is emmited from state Sj. 

Actually, computing bi{k) in this way is quite difficult because the number of oc- 
currences of a single word will be too small to provide enough statistical evidence, so 
Bayes rule is used to compute &i(fc) as: 



h{k) = P{wk\si) 



P{Si\wk)P{Wk) 

P{s^) 



where: 

ptg \ _ count (sj) _ p _ count(wk) 

* count{words) ’ count{words) 

being count{words) the number of words in the training corpus. 

Since P{si\wk) would also require lots of data to be properly estimated, we approx- 
imate it as P{t\wk), where t is the last tag in the n-gram corresponding to the state. 
Similarly, P{si) is approximated as P{t). 



2.2 Smoothing 

MLE is usually a bad estimator for NLP purposes, since data tends to be sparse'. This 
leads to zero probabilities being assigned to unseen events, causing troubles when mul- 
tiplying probabilities. 



* Following Zipf’s law: a word’s frequency is inversely proportional to its rank order. 
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To solve this sparseness problem it is necessary to look for estimators that assign 
a part of the probability mass to the unseen events. To do so, there are many differ- 
ent smoothing techniques, all of them consisting of decreasing the probability assigned 
to the seen events and distributing the remaining mass among the unseen events. In 
this work two smoothing methods are compared: Lidstone’s law and Linear Interpola- 
tion. 



Laplace and Lidstone’s Law. The oldest smoothing technique is Laplace’s law [9], 
that consists in adding one to all the observations. That means that all the unseen events 
will have their probability computed as if they had appeared once in the training data. 
Since one observation for each event (seen or unseen) is added, the number of different 
possible observations (B) has to be added to the number of real observations (N), in 
order to maintain the probability normalised. 

, . countix + 1) 

= N + B 

However, if the space is large and very sparse -and thus the number of possible events 
{B) may be similar to (or even larger than) the number of observed events- Laplace’s 
law gives them too much probability mass. 

A possible alternative is Lidstone’s law (see [10] for a detailed explanation on these 
and other smoothing techniques) which generalises Laplace’s and allows to add an 
arbitrary value to unseen events. So, for a relatively large number of unseen events, we 
can choose to add values lower than 1. For a relatively small number of unseen events, 
we may choose 1, or even larger values, if we have a large number of observations {N). 



Phd{x) = 



count{x) -f A 
N + BX 



A > 0 



To use Laplace’s or Lidstone’s laws in a HMM-based tagger we have to smooth all 
probabilities involved in the model: 



7T,: = 



count{si{t = 0)) -f An 



count{sentences) 



PtagX^z 



aij 



P{s^) 



P{Wk) 



count{si ^ Sj) + Aa 

COUnt(Si) + BtagXA 

count{si) + As 
count{words) + BtagXg 
count{wk) + Xyj 
count{words) + B^X^ 



where Btag is the number of possible tags and B^ is the number of words in the vocab- 
ulary (obviously, we can only approximate this quantity). 

Since there are different counts involved in each probability, we have to consider 
different A values for each formula. In the case of Laplace’s law, all A are set to 1, but 
when using Lidstone’s, we want to determine which is the best set of A, and how they 
vary depending on the train set size, as discussed in section 3.1. 
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Linear Interpolation. A more sophisticated smoothing technique consists of linearly 
combine the estimations for different order n-grams: 

+ C2-P(f3|f2) + C 3 P 

Pli{t4\tlt2h) = CiP{U) + C2P{t4\ti) + C:iP{tA\t2h) + CAP{t4\tit2h) 

where c, = 1 to normalise the probability. Although the values for can be deter- 
mined in many different ways, in this work they are estimated by deleted interpolation as 
described in [1]. This technique assumes that the Ci values don’t depend on the particular 
n-gram and computes the weights depending on the counts for each i-gram involved in 
the interpolation. 



3 Experiments and Results 

The main purpose of this work is to study the behaviour of different configurations for 
a HMM-based PoS tagger, in order to determine the best choice to develop taggers for 
languages with scarce annotated corpora available. 

First, we will explore different configurations when a large amount of training corpus 
is available. The experiments will be performed on English, using 1 . 1 million words from 
the Wall Street Journal corpus. Then, the same configurations will be explored when the 
training set is reduced to 100,000 words. 

Later, the behaviour on the reduced train set will be validated on a manually developed 
100,000 word corpus for Spanish [4]. 

The tested configurations vary on the order of the used HMM (trigram or 4-gram), 
and on the smoothing applied (Lidstone’s law or Linear Interpolation). 

All the experiments are done using ten fold cross-validation. In each fold, 90% of 
the corpus is used to train the tagger and the rest to test it. 

In the following sections, we present the behaviour of the different HMM configu- 
rations for English, both with a large and a small corpus. After, we repeat the tests using 
a small corpus for Spanish. 

3.1 Applying Lidstone’s Law 

As it has been explained in section 1 , when Lidstone’s law is used in a HMM tagger, there 
are four A values to consider. Changing these values significantly affects the precision 
of the system. 

Thus, before comparing this smoothing technique with another, we have to select the 
set of A that yields the best tagger performance. After performing some experiments, 
we observed that A^i is the only parameter that significantly affects the behaviour of 
the system. Modifying the other three values didn’t change the system precision in a 
significant way. So A^, A«, and A™ were set to some values determined as follows: 

- A^ is the assigned count for unobserved initial states. Since initial states depend 
only on the tag of the first word in the sentence, and the tag set we are using is quite 
reduced (about 40 tags), we may consider that in a 1,000,000 word corpus, at least 
one sentence will start with each tag. So, we will count one occurrence for unseen 
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events (i.e. we are using = 1, Laplace’s law, for this case). When the corpus is 
ten times smaller, we will use a proportional rate of occurrence (A,r = 0.1). 

- As is the count assigned to the unseen states. Since we approximate P{si) hy P{t) 
(see section 2.1), the possible events are the number of tags in the tag set, and we 
can reason as above, assuming at least one occurrence of each tag in a 1,000,000 
word corpus (again, Laplace’s law, Ag = 1), and a proportional value for the small 
corpus (Ag = 0.1) 

- Xw is the count assigned to the unseen words. Obviously, enforcing that each possible 
word will appear at least once would take too many probability mass out of seen 
events (English vocabulary is about 100,000 forms, which would represent 10% of 
a 1 million word corpus), so we adopt a more conservative value: A^j = 0.1 for the 
large corpus, and the proportional value A^ = 0.01 for the small one. 

After setting these three A values, we have to select the best value for A^i. To dimin- 
ish the risk of getting local maxima, we will repeatedly use hill-climbing with different 
starting values and step lengths (AX), and choose the value that produces better re- 
sults. 

In table 2 the results of this hill-climbing algorithm using the whole English corpus 
(1 million of words) are presented. Table 3 show the same results for the 100.000 words 
English corpus. 



Table 2. Precision obtained applying hill-climbing on the complete English corpus 





1 Trigram HMM | 


1 4-gram HMM | 


Initial 

Aa 


Z\A 


Initial 

precision 


Final | 


Initial 

precision 


Final | 


precision 


Aa 


precision 


Aa 


0.05 


0.01 


96.98 


97.00 


0.22 


96.72 


96.88 


0.28 


0.05 


0.005 


96.98 


96.99 


0.085 


96.72 


96.81 


0.125 


0.5 


0.1 


97.008 


97.009 


0.6 


96.91 


96.93 


1.0 


0.5 


0.05 


97.008 


97.009 


0.4 


96.91 


96.93 


0.95 


1.0 


0.5 


97.00 


97.01 


0.5 


96.93 


96.94 


1.5 


1.0 


0.1 


97.00 


97.01 


0.6 


96.93 


96.93 


1.0 



Table 3. Precision obtained applying hill-climbing on the reduced English corpus 





1 Trigram HMM 


1 4-gram HMM | 


Initial 


AA 


Initial 


Final 


Initial 


Final | 


Aa 




precision 


precision 


Aa 


precision 


precision 


Aa 


0.05 


0.01 


96.56 


96.63 


0.09 


95.79 


96.30 


0.33 


0.05 


0.005 


96.56 


96.63 


0.1 


95.79 


96.20 


0.2 


0.5 


0.1 


96.69 


96.69 


0.5 


96.36 


96.43 


0.8 


0.5 


0.05 


96.69 


96.69 


0.5 


96.36 


96.43 


0.75 


1.0 


0.5 


96.70 


96.70 


1.0 


96.44 


96.51 


3.5 


1.0 


0.1 


96.70 


96.71 


0.9 


96.44 


96.46 


1.2 
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As it was expected, when using a small corpus the precision falls, specially when 
a 4-gram HMM is used, since the evidence to estimate the model is insufficient. This 
point is discussed in section 3.3. 

These results show that the value selected for A/i is an important factor when using 
this smoothing technique. As can be seen in table 3, the precision of the tagger varies up 
0.7% depending on the value used for A^- 

After performing the hill climbing search, we choose the A^ that gives better results 
in each case, as the optimal parameter to use with this smoothing technique. So, for the 
whole corpus using a Trigram HMM, A^ is set to 0.6 and the tagger yields a precision of 
97.01%, while if we use a 4-gram HMM, A^ = 1.5 leads to a precision of 96.94%. When 
the experiments are performed over the reduced corpus, the best results are obtained with 
A .4 = 0.9foratrigramHMM(96.71%)andwith A^ = 3.5 for a 4-gram model (96.51%). 

3.2 Applying Linear Interpolation 

The performance of the taggers when using Linear Interpolation to smooth the probability 
estimations has been also tested. In this case, the coefficients Ci are found via the deleted 
interpolation algorithm (see section 2.2). 

When using Linear Interpolation, the precision obtained by the system with the 
whole corpus is 97.00% with a trigram HMM, and 97.02% with a 4-gram HMM. For the 
reduced corpus the precision falls slightly and we obtain 96.84% for the trigram model 
and 96.71% for the 4-gram HMM. 

The results obtained using Linear Interpolation and a trigram model should repro- 
duce those reported by [1], where the maximum precision reached by the system on 
WSJ is 96.7%. In our case we obtain a higher precision because we are assuming the 
nonexistence of unknown words (i.e. the dictionary contains all possible tags for all 
words appearing in the test set. Obviously, word-tag frequency information from the test 
corpus is not used when computing P{si\wk))- 

3.3 Best Configuration for English 

Best results obtained for each HMM tagger configuration are summarized in table 4. 
Results are given both for the large and small corpus. 

Comparing the results for the two smoothing methods used with different order 
models, we can draw the following conclusions: 

- In general. Linear Interpolation produces taggers with higher precision than using 
Lidstone’s law. 

- For the case of the large corpus, the results are not significantly different for any 
combination of n-gram order and smoothing technique. While for the reduced cor- 
pus it is clearly better to use a trigram model than a 4-gram HMM, and Linear 
Interpolation yields slightly better results. 

- Using Linear Interpolation has the benefit that the involved coefficients are com- 
puted using the training data via deleted interpolation, while for Lidstone’s law the 
precision is very dependent on the A^ value, which has to be costly optimised (e.g. 
via hill-climbing). 




134 



M. Padro and L. Padro 



Table 4. Obtained results for all HMM PoS tagger configurations using large and small sections 
of WSJ corpus 



1,1 Mword English corpus 





Lidstone’s law Linear Interpolation 


trigram 

4-gram 


97.01% 97.00% 

96.94% 97.02% 



100 Kword English corpus 





Lidstone’s law Linear Interpolation 


trigram 

4-gram 


96.71% 96.84% 

95.51% 96.71% 



3.4 Behaviour in Spanish 

The same experiments performed for English were performed with a Spanish corpus 
(CLiC-TALP Corpus^) which has about 100,000 words. This corpus is manually vali- 
dated so, although it is small, it is more accurately tagged than WSJ. 

In this case the tagger relies on FreeLing morphological analyser [3] instead of 
using a dictionary built from the corpus. Nevertheless, the situation is comparable to the 
English experiments above: Since the corpus and the morphological analyser have been 
hand-developed and cross-checked, they are mutually consistent, and so we don’t have 
to care about unknown words in the test corpus. 

Applying Lidstone’s Law. In the same way than for the English corpus, a hill-climbing 
search is performed to study the influence of value in the precision of the sys- 
tem. The A,r, As and A^ values are fixed to the same values used for the reduced 
WSJ. 

Table 5. Precision obtained with the hill-climbing algorithm on the Spanish corpus 





1 Trigram HMM | 


1 4-gram HMM | 


Initial 

Aa 


Z\A 


Initial 

precision 


Final | 


Initial 

precision 


Final | 


precision 


Aa 


precision 


Aa 


0.05 


0.01 


96.54 


96.68 


0.18 


95.49 


96.00 


0.35 


0.05 


0.005 


96.54 


96.58 


0.065 


95.49 


95.82 


0.15 


0.5 


0.1 


96.79 


96.80 


0.5 


96.06 


96.16 


1.6 


0.5 


0.05 


96.79 


96.81 


0.55 


96.06 


96.14 


1.05 


1.0 


0.5 


96.80 


96.85 


1.5 


96.14 


96.22 


2.5 


1.0 


0.1 


96.80 


96.84 


1.1 


96.14 


96.16 


1.6 



Table 5 shows the results of these experiments. The best Aa for the trigram HMM 
is Xa = 1-5, yielding a precision of 96.85%. The best value for a 4-gram model is 
Xa = 2.5, which produces a precision of 96.22% 



^ More information in http ; / /www. Isi . upc . es/~nlp/ 
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Applying Linear Interpolation. The coefficients for Linear Interpolation are computed 
for Spanish in the same way than for English (section 3.2). The precision of the obtained 
taggers is 96.90% for the trigram HMM and 96.73% for the 4-gram model. 

Best Configuration for Spanish. Results for Spanish are -as it may be expected- 
similar to those obtained with the reduced English corpus. Again, working with a trigram 
HMM gives higher precision than working with a 4-gram one, for both smoothing 
techniques, and using Linear Interpolation gives a slightly better results than using 
Lidstone’s law. Table 6 summarizes the obtained results for both smoothing methods. 



Table 6. Obtained results for all HMM PoS tagger configurations using Spanish 100 Kwords 
corpus 



100 Kword Spanish corpus 





Lidstone’s law Linear Interpolation 


trigram 

4-gram 


96.85% 96 . 90 % 

96.22% 96.73% 



Nevertheless, some important remarks can be extracted from these results: 

- Competitive HMM taggers may be build using relatively small train sets, which is 
interesting for languages lacking large resources. 

- The best results are obtained using trigram models and Linear Interpolation. 

- Lidstone’s law may be used as an alternative smoothing technique, but if is not 
tuned, results are likely to be significantly lower. 



4 Conclusions and Further Work 

We have studied how competitive HMM-based PoS taggers can be developed using 
relatively small training corpus. 

Results point out that accurate taggers can be build provided the appropriate smooth- 
ing techniques are used. Between both techniques studied here, in general the one that 
gives a higher precision is Linear Interpolation but Lidstone’s law can reach, in many 
cases, similar precision rates if a search is performed through the parameter space to 
find the most appropriate A^- 

The model proposed in [1] (trigram tagger. Linear Interpolation smoothing) is not 
only the more suitable for big training corpus but also it gives the best results for limited 
amounts of training data. 

The use of four-gram models may result in a slight increase in precision when using 
large corpus. Nevertheless, the gain is probably not worth the increase in complexity 
and size of the model. 

Eurther work to be performed includes dealing with unknown words, and study their 
influence on the taggers developed on small corpus. Also, we plan to port the same 
experiments to other languages (namely, Catalan) to further validate the conclusions of 
this paper. 
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Abstract. When automatically translating between related languages, 
one of the main sources of machine translation errors is the incorrect 
resolution of part-of-speech (PoS) ambiguities. Hidden Markov models 
(HMM) are the standard statistical approach to try to properly resolve 
such ambiguities. The usual training algorithms collect statistics from 
source-language texts in order to adjust the parameters of the HMM, 
but if the HMM is to be embedded in a machine translation system, 
target-language information may also prove valuable. We study how to 
use a target-language model (in addition to source-language texts) to im- 
prove the tagging and translation performance of a statistical PoS tagger 
of an otherwise rule-based, shallow-transfer machine translation engine, 
although other architectures may be considered as well. The method may 
also be used to customize the machine translation engine to a particular 
target language, text type, or subject, or to statistically “retune” it after 
introducing new transfer rules. 



1 Introduction 

One of the main sources of errors in machine translation (MT) systems between 
related languages is the incorrect resolution of part-of-speech (PoS) ambiguities. 
Hidden Markov models (HMMs) [9] are the standard statistical approach [3] to 
automatic PoS tagging. Typically the training of this kind of taggers has been 
carried out from source-language (SL) untagged corpora (see below) using the 
Baum- Welch algorithm [9]. 

But target-language (TL) texts may also be taken into account in order to 
improve the performance of these PoS taggers, specially as to the resulting trans- 
lation quality, an aspect not faced by training algorithms which use information 
from SL only. We propose a training method for HMMs which considers the 
likelihood in the TL of the translation of each of the multiple disambiguations 
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BES-2004-4711. We thank Rafael C. Carrasco for useful comments on this work. We 
also thank Geoffrey Sampson (University of Sussex, England) for his Simple Good- 
Turing implementation. 
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of a source text which can be produced depending on how its PoS ambiguity 
is resolved. To achieve this goal, these steps are followed: first, the SL text is 
segmented; then, the set of all possible disambiguations for each segment is gener- 
ated; after that, each disambiguation is translated into TL; next, a TL statistical 
model is used to compute the likelihood of each translated disambiguation of the 
segment; and, finally, these likelihoods are used to adjust the parameters of the 
SL HMM: the higher the likelihood, the higher the probability of the original 
SL tag sequence in the model being trained. Rules for text segmentation must 
be carefully chosen so that the resulting segments are treated independently by 
the rest of the modules in the MT system. 

One of the main obstacles to overcome is the presence of , that is, an 

ambiguous SL word which is translated into the same TL word for every possible 
disambiguation; therefore, the ambiguity remains intact in the TL and no TL 
information can be used for disambiguation purposes. This is specially harmful 
in the case of related languages, where free rides are very common. 

Most current MT systems follow the or approach [5, ch. 4]: 

SL text is analysed and converted into an intermediate representation which 
becomes the basis for generating the corresponding TL text. Analysis modules 
usually include a PoS tagger for the SL. Our method for training PoS taggers may 
be applied, in principle, to any variant of an indirect architecture which uses or 
may use a HMM-based PoS tagger. In particular, a MT system using a classical 
architecture will be considered in the experiments. 

We will refer to a text as . or just when each 

occurrence of each word (ambiguous or not) has been assigned the correct PoS 
tag. An , or text corpus is one in which all 

words are assigned (using a morphological analyser) the set of possible PoS tags 
independently of context; in this case, ambiguous and unknown words would 
receive more than one PoS tag (unknown words, that is, words not found in 
the lexicon, are usually assigned the set of categories, that is, categories 
which are likely to grow by new words of the language: nouns, verbs, adjectives, 
adverbs and proper nouns). Words receiving the same set of PoS tags are said to 
belong to the same , [3]; for example, the words and 

both belong to the ambiguity class {noun, verb}. 

The paper is organized as follows. Section 2 presents the basics of HMM use 
in disambiguation tasks and discusses existing methods for PoS tagger training; 
section 3 describes our proposal for HMM training and the TL model used in this 
paper; section 4 introduces the translation engine and shows the main results of 
the experiments; finally, in sections 5 and 6 we discuss the results and outline 
future work to be done. 



2 Part-of-Speech Tagging 

When a HMM is used for lexical disambiguation purposes in the 

(in which each input word is replaced by its corresponding ambiguity 
class) each HMM state is made to correspond to a different PoS tag and the set 
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of observable items consists of all possible ambiguity classes [3] . Building a PoS 
tagger based on HMMs for the SL in a MT system usually implies: 

1. (set of PoS) which groups the finer 
tags delivered by the morphological analyser into a smaller set of coarse tags 
adequate to the translation task (for example, singular feminine nouns and 
plural feminine nouns may be grouped under a single “noun” tag). Addi- 
tionally, the number of different lexical probabilities in the HMM is usually 
drastically reduced by grouping words in ambiguity classes. 

2. , that is, finding adequate values of the 

parameters of the model. Existing methods may be grouped according to 
the kind of corpus they use as input: methods require 

(see the introduction); methods are able to 

extract information from . , that is, from sequences 

of ambiguity classes. 

On the one hand, estimating parameters from an unambiguously tagged 
corpus is usually the best way to improve performance, but unambiguously 
tagged corpora are expensive to obtain and require costly human supervi- 
sion. A supervised method counts the number of occurrences in the corpus of 
certain tag sequences (usually bigrams) and uses this information to deter- 
mine the values of the parameters of the HMM. On the other hand, for the 
unsupervised approach no analytical method is known, and existing meth- 
ods, such as the Baum- Welch algorithm [9], are only guaranteed to reach 
local (not global) maxima of the expectation. 

Some estimation methods, like those using statistics from unambiguously 
tagged corpus, may be used in isolation or as an initialization algorithm for 
further reestimation via the Baum- Welch method. In other cases, simple 
estimation methods are exclusively considered for initialization purposes. 
A good initialization can significantly improve the final performance of a 
Baum- Welch-trained PoS tagger, although it will not completely avoid the 
risk of convergence at local maxima. 

The new method presented in the following section requires only ambigu- 
ously tagged SL texts and raw TL texts (needed to compute a TL model); it 
may, in principle, be used as a complete training method by itself, although 
it may as well be considered for initialization purposes. 



3 Target-Language-Based Training 

This section gives mathematical details on how to train a SL HMM using infor- 
mation from the TL. 

Let S be the whole SL corpus, s be a (possibly ambiguous) segment from S, 
gi a sequence of tags resulting from one of the possible disambiguation choices 
in s, T{gi,s) the translation of gi in the TL, and p^i^{r{gi, s)) the likelihood 
of T{gi,s) in some TL model. We will call each gi a since it describes a 
unique state path in the HMM and write gi G T{s) to show that gi is a possible 
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disambiguation of the words in s. Now, the likelihood of path gi from segment s 
may be estimated as: 



P{9^\s) 



p(g»|T(ff»,5))pTL(T(g»,s)) 

X! Pi9o\T{9j,s))p^^{T{gj,s)) 
9j 



( 1 ) 



where the term p{gi\T{gi, s)) is the conditional probability of gi given translation 
T{gi,s). That is, the likelihood of path gi in source segment s is made propor- 
tional to the TL likelihood of its translation T{gi,s), but needs to be corrected 
by a weight p(gi|r(gi, s)), because more than one gi may contribute to the same 
T{gi,s). 

The fact that more than one path in segment s, say gi and gj, can produce 
the same translation in TL (that is, r{gi, s) = r{gj, s) with i yf j) does not imply 
that p{gi\T{gi, s)) = p{gj\T{gj, s)). Indeed, the real probabilities of paths are in 
principle unknown (note that their computation is the main goal of the training 
method). In the absence of such information, the contributions of each path will 
be approximated in this paper to be equally likely: 



P{9t\r{9i,s)) = — — TWb ? 1 ? vn 

card({5j G T(s) : T(gj,s) = T{gi,s)}) 



(2) 



Now, we describe how to obtain the parameters of the HMM from the esti- 
mated likelihood of each path in each segment, p{gi\s), which will be treated as 
a fractional count. An estimate of tag pair occurrence frequency based on p{gi\s) 
is: 



= C!s,g,iJi,JJ)p{g^\s) (3) 

sGS giGT{s) 



where Cs,g^ (yi, jj) is the number of times tag 7 ^ is followed by tag 7 ^ in path gi of 
segment s. Therefore, the HMM parameter corresponding to the transition 
probability from the state associated with tag 7 i to the state associated with tag 
7 j [9, 3] can be computed as follows: 



^ Hiaj) 



(4) 



where F is the tagset, that is, the set of all PoS tags. 

In order to calculate emission probabilities, the number of times an ambiguity 
class is emitted by a given tag is approximated by means of: 



«(o'.7)=Xl ^s,gii^,7)p{9i\s) (5) 

s^S 



where Cs,gi{<J,^) is the number of times ambiguity class a is emitted by tag 7 
in path gi of segment s. Therefore, the HMM parameter corresponding to 
the emission probability of ambiguity class cr from the state associated with 7 ^ 
is computed as: 

. ^ n(o-,7i) 



(6) 
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Source language 


Target language 


P{9i\s) 


s = y la para si 

{pm} {pt} {CNJ} 


r(pi,s)= i la per a si 
gi = CNJ ART PR CNJ 


0.0001 


r(p 2 ,s)= 1 la para si 

Q2 = CNJ ART VB CNJ 


0.4999 


r((? 3 ,s)= i la per a si 
gs = CNJ PRN PR CNJ 


0.0001 


r(g 4 ,s)= 1 la para si 

P4 = CNJ PRN VB CNJ 


0.4999 



Fig. 1. Example of an ambiguous SL (Spanish) text segment, paths and translations 
(into Catalan) resulting from each possible disambiguation, and normalized estimated 
likelihood for each path translation. The second source-language word (la) is a free 
ride, as can be observed in the corresponding translation into target language 

Notice that when training from unambiguously tagged texts the expressions 
used to compute transition and emission probabilities are analogous to previous 
equations, but in this case p{gi\s) = 1 in (3) and (5) as only one path is possible 
in a tagged corpus segment; therefore, (3) and (5) would not be approximate 
anymore, but exact. Figure 1 shows an example of the application of the method. 

SL text segmentation must be carefully designed so that two words which 
get joint treatment in some stage of processing of the MT system are not asso- 
ciated to different segments. This would result in incorrect sequences in TL (for 
example, if two words involved in a word reordering rule are assigned to differ- 
ent segments) and, as a consequence of that, in wrong likelihood estimations. In 
general, HMMs can be trained by breaking the corpus into segments whose first 
and last word are unambiguous, since unambiguous words reveal or the 

hidden state of the HMM [3, sect. 3.4]. Adequate strategies for ensuring segment 
independence depend on the particular translation system (we will describe in 
section 4 the strategy used in our experiments). 

3.1 Target- Language Model 

A classical trigram model of TL surface forms (SF, lexical units as they appear 
in original texts) is considered for this paper, although it may be worth studying 
some other language models. The trigram model is obtained in an unsupervised 
manner from a 1 822 067-word TL corpus taken from Catalan newspapers. 

In order to avoid unseen trigrams to give zero probability for every text 
segment containing them, probabilities are smoothed via a form of deleted inter- 
polation [6] . The smoothed trigram probabilities consist of a linear combination 
of trigram, bigram and unigram probabilities, and the successive linear abstrac- 
tion approximation [1] is used to compute the corresponding coefficients. Since in 
the case of unseen words the resulting smoothed trigram probability is still zero, 
unigram probabilities are smoothed as well using the Good-Turing method [4]. 

When evaluating path likelihoods, if text segmentation is correctly performed 
so that segments are independent (as already mentioned), a good estimate of 
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tagger 
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tor 



TL 

text 



: 

lexical 

transfer 



Fig. 2. Main modules of the transfer machine translation system (see section 4.1) used 
in the experiments 



trigram probabilities for a given path can be performed considering all possible 
translations of the two words preceding the current segment and the two first 
words of the following one. This local approach can be safely used because a 
complete trigram likelihood evaluation for the whole corpus would multiply the 
likelihood by the same factor^ for every possible path of the segment and, there- 
fore, it would not affect the normalized likelihood estimated for each path of a 
segment in eq. (1). 

Finally, notice that computing likelihoods as trigram probability products 
causes (as in most statistical MT approaches) shorter translations to receive 
higher scores than larger ones. 



4 Experiments 

4.1 Machine Translation Engine 

Since our training algorithm assumes the existence of a MT system (most likely, 
the system in which the resulting PoS tagger will be embedded) in order to 
produce texts from which statistics about TL will be collected, we briefly intro- 
duce the system used in the experiments, although almost any other architecture 
(with a HMM-based PoS tagger) may also be suitable for the algorithm. 

We used the publicly accessible Spanish-Catalan (two related languages) MT 
system interNOSTRUM [2], which basically follows the (morphological) transfer 
architecture shown in figure 2: 

— A , tokenizes the text in surface forms (SF) and deliv- 
ers, for each SF, one or more lexical forms (LF) consisting of , 

, and morphological inflection information. 

— A (categorial disambiguator) chooses, using a hidden Markov 

model (HMM), one of the LFs corresponding to an ambiguous SF. 

~ A module reads each SL LF and delivers the corresponding 

TL LF. 



^ This factor results from the contribution of trigrams which do not include words in 
the current segment. 
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— A module (parallel to the lexical transfer) uses a finite- 

state chunker to detect patterns of LFs which need to be processed for word 
reordering, agreement, etc. 

— A delivers a TL SF for each TL LF, by suitably 

inflecting it. 

— A performs some orthographical operations such as contrac- 

tions. 

4.2 Text Segmentation 

An adequate strategy for SL text segmentation is necessary. Besides the general 
rules mentioned in section 3, in our setup it must be ensured that all words in 
every pattern transformed by the structural transfer belong to the same segment. 

The strategy followed in this paper is segmenting at nonambiguous words 
whose PoS tag is not present in any structural transfer rule or at nonambiguous 
words appearing in rules not applicable in the current context. In addition, an ex- 
ception is being taken into account: no segmentation is performed at words which 
start a multiword unit that could be processed by the postgenerator (for exam- 
ple, followed by , which usually translates as in Catalan). Unknown 
words are also treated as segmentation points, since the has no 

bilingual information for them and no rule is activated at all. 

4.3 Results 

We study both PoS tagging performance and translation performance after train- 
ing the PoS tagger for Spanish. The Spanish corpus is divided into segments as 
described in 4.2. For each segment, all possible translations into TL (Catalan) 
according to every possible combination of disambiguations are considered. The 
likelihoods of these translations are computed through a Catalan trigram model 
and then normalized and transferred to the transition matrix and emission ma- 
trix of the HMM as described in section 3. The whole process is unsupervised: 
no unambiguously tagged text is needed. 

The tagset used by the Spanish PoS disambiguator consists of 82 coarse tags 
(69 single-word and 13 multi-word tags for contractions, verbs with clitic pro- 
nouns, etc.) grouping the 1 917 fine tags (386 single-word and 1 531 multiword 
tags) generated by the morphological analyser. The number of observed ambigu- 
ity classes is 249. In addition, a few words such as (preposition or verb), 
(conjunction or relative), (preposition, relative or verb) and 

(adverb or adjective) are assigned special hidden states, and consequently special 
ambiguity classes, in a similar manner to that described in [8] . 

For comparison purposes, a HMM-based PoS tagger was trained from am- 
biguously tagged SL corpora following a classical approach, that is, initializing 
the parameters of the HMM by means of Kupiec’s method [7] and using the 
Baum- Welch algorithm to reestimate the model; a 1 000 000-word ambiguous 
corpus was used for training. The resulting PoS tagger was tested after each 
iteration and the one giving an error rate which did not improve in the subse- 
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Thousands of words 




Fig. 3. PoS tagging error rate (top) and translation error rate (bottom). The Banm- 
Welch error rate after training with a 1 000 000- word corpus is given as well. PoS tagging 
error rate is expressed as the percentage of incorrect tags assigned to ambiguous words 
(including nnknown words). Translation errors are expressed as the percentage of words 
that need to be post-edited dne to mistaggings 

quent 3 iterations was chosen for evaluation; proceeding in this way, we prevent 
the algorithm from stopping if a better PoS tagger can still be obtained. More- 
over, another HMM was trained from an unambiguously tagged 20 000-word SL 
corpus and used as a reference of the best attainable results. 

A set of disjoint SL corpora with 200 000 words each was considered for evalu- 
ating the proposed method and the resulting performance was recorded at every 
1 000 words. Figure 3 shows the evolution of the PoS tagging error rate and 
the translation error rate for one representative corpus (the rest of the corpora 
behave in a similar way) ; Baum- Welch results are reported there as well. PoS tag- 
ging errors in figure 3 are expressed as the percentage of incorrect tags assigned 
to (including unknown words), not as the overall percentage 

of correct tags (over ambiguous and nonambiguous words); translation errors, 
however, do not consider unknown words and are expressed as the percentage 
of words that need to be corrected or inserted because of wrongly tagged words 
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when post-editing the translation (in some cases, a wrongly tagged word implies 
correcting more than one TL word because of single words translating into mul- 
tiword units or because of actions of the structural transfer or the postgenerator 
which would not have been performed if the word had been correctly tagged or 
vice versa). 

The PoS tagging error is evaluated using an independent 8 031- word unam- 
biguously tagged Spanish corpus. The percentage of ambiguous words (accord- 
ing to the lexicon) is 26.71% and the percentage of unknown words is 1.95%. 
For translation evaluation, an 8 035-word Spanish corpus and the corresponding 
human-corrected machine translation into Catalan are used. 

The tagging error rate obtained with the PoS tagger trained from a 

corpus (obtained in a supervised manner) is 10.35% and the translation 
error rate is 2.60%; these figures can be used as a reference for the best possible 
results. 

As can shown in figure 3, sudden oscillations causing the PoS tagging er- 
ror to change around 10% occur in just one step. This behaviour is due to free 
rides (very common in the case of related languages like Spanish and Catalan): 
since free rides give the same translation regardless of disambiguation choices, 
the TL trigram model can not be used to distinguish among them and, conse- 
quently, paths involving free rides receive the same weight when estimating the 
parameters of the PoS tagger. 

The most common free rides in Spanish-Catalan translation are the Spanish 
words , and that belong to the same ambiguity class formed by article 
and proclitic pronoun. In the evaluation corpus these three words are 22.98% of 
the ambiguous words. Nevertheless, it may be argued that free rides should not 
affect the translation error rate; however, they do. This is because depending on 
the tag choice the performs changes (gender agreement, for 

example) or the performs contractions; in these cases, these words 

are not free rides, but the number of times they occur is not enough for the TL 
trigram model to capture their influence and make the system more stable. In 
the next subsection, a way of addressing this undesirable effect of free rides is 
explored. 

4.4 Reducing the Impact of Ftee Rides 

The problem of free rides may be partially solved if linguistic information is 
used to forbid some impossible tag bigrams so that paths containing 

are ignored. The previous experiments were repeated introducing this 
new approach and results are discussed here. 

A HMM-based PoS tagger trained via the Baum- Welch algorithm is used 
again for comparison purposes, but using the information from forbidden bi- 
grams as follows: forbidden bigrams are transferred into the HMM parameters 
by introducing zero values in the transition matrix after Kupiec’s initialization 
but before training. The Baum- Welch algorithm naturally preserves these zeroes 
during the reestimation process. 
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Fig. 4. PoS tagging error rate (top) and translation error rate (bottom) when rules 
to forbid impossible bigrams are considered (compare with figure 3). The Baum- Welch 
error rate after training with a 1 000 000- word corpus is given as well. PoS tagging 
error rate is expressed as the percentage of incorrect tags assigned to ambiguous words 
(including unknown words). Translation errors are expressed as the percentage of words 
that need to be corrected (post-edited) due to mistaggings 

The number of forbidden bigrams (independently collected by a linguist) is 
218; the more statistically important are before , 

followed by words which are not or 

, and before 

As can be seen in figure 4 (compare with figure 3 where the same corpus is 
considered), sudden oscillations decrease and the PoS tagging error rate is sig- 
nificantly lower than that obtained without forbidden bigrams. Smaller sudden 
oscillations still happen due to other free rides (for example, , 
or ) not in the set of forbidden bigrams but with a secondary 

presence in Spanish corpora. 

Concerning execution time, the new method needs higher training time than 
the Baum- Welch algorithm because of the enormous number of translations and 
path likelihoods that need to be explicitly considered (remember, however, that 
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the time necessary for processing ambiguous texts after training is independent of 
the algorithm being used for estimating the parameters of the HMM) . With the 
example corpus the original algorithm takes up to 44 hours in a typical desktop 
computer, and around 16 hours when forbidden bigrams are introduced. The 
number of paths Up and, consequently, the number of segment translations grows 
exponentially with segment length I and can be approximated by Up « 1.46^ 



5 Discussion 

It has been shown that training HMM-based PoS taggers using unsupervised 
information from TL texts is relatively easy. Moreover, both tagging and trans- 
lation errors lie between those produced by classical unsupervised models using 
Baum- Welch estimation and those attained with a supervised solution based on 
nonambiguous texts. 

The presence of free rides — very common when translation involves closely- 
related languages like Spanish and Catalan (both coming from Latin and more 
related than other Romance languages) — makes the algorithm behave unstably 
due to the kind of TL model used in the experiments (superficial form trigrams) . 
This problem may be partially overcome by using a small amount of linguistic 
information (forbidden bigrams) which can be obtained in most cases much more 
easily than the hand-tagged corpora needed for supervised training. 

Since our goal is to produce PoS taggers to be embedded in MT systems, we 
can focus on translation error and conclude that, despite the fact that the tagging 
error rate starts with higher values than the one obtained with a Baum- Welch- 
trained tagger, the final translation error is around 2% smaller. Our method 
significantly reduces the tagging error when compared with other training meth- 
ods using ambiguously tagged SL texts. 

The training method described in this paper produces a PoS tagger which is 
in tune not only with SL but also with the TL of the translation engine. This 
makes it suitable for training PoS taggers to be embedded in MT systems. The 
method may also be used to customize the MT engine to a particular text type 
or subject or to statistically “retune” it after introducing new transfer rules in 
the MT system. 



6 Future Work 

We plan to study other TL models. One of the most interesting alternatives 
in the case of a bidirectional MT system is to consider another HMM as TL 
model. In this way, the two HMMs used as PoS taggers in a bidirectional MT 
system could be trained simultaneously from scratch by initializing one of them 
(with Kupiec’s method, for example) and using it as TL model to estimate 
the parameters of the other one through our training algorithm. After that, 
roles could be interchanged so that the last HMM being trained is now used as 
TL model, and so on until convergence. The process may be seen as a kind of 
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bootstrapping: the PoS tagger for one of the languages is initialized in a simple 
way and both HMMs alternate either as TL model or as adjustable one. 

A different line of research will study the improvement of the estimation in 
eq. (2). A better estimation for p{gi\T{gi, s)) might reduce the impact of free rides 
without considering linguistic information. One possible approach is to query the 
model currently being estimated about this probability. 

Finally, another line of work will focus on time complexity reduction. On 
the one hand, we propose to introduce new forbidden tag bigrams to reduce 
even more the number of translations to be computed. On the other hand, other 
strategies may prove valuable like, for example, using the model being estimated 
to calculate approximate likelihoods which make it possible to consider only the 
k best paths for translation. 
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Abstract. We define Probabilistic Constrained W-grammars (PCW- 
grammars), a two- level formalism capable of capturing grammatical 
frameworks used in two state of the art parsers, namely bilexical gram- 
mars and stochastic tree substitution grammars. We provide embeddings 
of these parser formalisms into PCW-grammars, which allows us to de- 
rive properties about their expressive power and consistency, and rela- 
tions between the formalisms studied. 



1 Introduction 

State of the art statistical natural language parsers, e.g., [1,3,4] are procedures 
for extracting the syntactic structure hidden in natural language sentences. Usu- 
ally, statistical parsers have two clearly identifiable main components. One has 
to do with the nature of the set of syntactic analyses that the parser can provide. 
It is usually defined using a grammatical framework, such as probabilistic con- 
text free grammars (PCFGs), bilexical grammars, etc. The second component 
concerns the way in which the different parts in the grammatical formalism are 
learned. For example, PGFGs can be read from tree-banks and their probabilities 
estimated using maximum likelihood [11]. 

Glearly, the grammatical framework underlying a parser is a key component 
in the overall definition of the parser which determines important characteris- 
tics of the parser, either directly or indirectly. Among others, the grammatical 
framework defines the set of languages the parser can potentially deal with, a 
lower bound on the parser’s complexity, and the type of items that should be 
learned by the second component mentioned above. Hence, a thorough under- 
standing of the grammatical framework on which a parser is based provides a 
great deal of information about the parser itself. We are particularly interested 
in the following properties: (1) The expressive power of a grammar formalism. 
(2) Gonditions under which the probability distribution defined over the set 
of possible syntactic analyses is consistent: if this is the case, the probabilities 
associated with an analysis can be used as meaningful probabilistic indicators 
both for further stages of processing [11] and for evaluation [7]. (3) The relation 
to other grammatical frameworks; this provides insights about the assumptions 
made by the various frameworks. 
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Since building a parser is a time consuming process, formal properties of the 
underlying grammatical framework are not always a priority. Also, comparisons 
between parser models are usually based on experimental evidence. In order to 
establish formal properties of parsers and to facilitate the comparison of parsers, 
we believe that a unifying grammatical framework, of which different parsers’ 
grammars can be obtained as instances, is instrumental. Our main contribu- 
tion in this paper is the introduction of a grammatical framework capable of 
capturing state of the art grammatical formalisms. Our framework is based on 
so-called W-grammars, due originally to Van Wijngaarden [13]. We constrain W- 
grammars to obtain CW-grammars, which are more suitable for statistical natu- 
ral language parsing than W-grammars. PCW-grammars extend CW-grammars 
with probabilities. In this paper we provide embeddings of bilexical grammars 
[5] and stochastic tree substitution grammars [1] into PCW-grammars, and we 
use these embeddings to derive results on expressive power, consistency, and re- 
lations with other grammatical formalisms. Due to lack of space, embeddings of 
further grammatical formalisms have had to be omitted. 

In Section 2 we present our grammatical framework and establish results on 
expressive power and conditions for inducing consistent distributions. In Sec- 
tion 3 we capture the models mentioned above in our framework, and derive 
consequences of the embeddings. In Section 4 we conclude. 

2 Grammatical Framework 

In this section we describe the grammatical framework we will be working with. 
We introduce constrained W-grammars, then present a probabilistic version, and 
also introduce technical notions needed in later sections. 

2.1 Constrained W- Grammars 

A ( ) is a 6-tuple (V, NT, T, S, 

such that: 

— V is a set of symbols called . Elements in V are denoted with calli- 

graphic characters, e.g., A,B,C. 

— NT is a set of symbols called ; elements in NT are denoted 

with upper-case letters, e.g., X, Y, Z. 

— T is a set of symbols called , denoted with lower-case letters, e.g.: 

a, b, c, such that V, T and NT are pairwise disjoint. 

— S' is an element of NT called 

— is a finite binary relation defined on ( V U NT LIT)* such that if x y, 
then X G V. The elements of are called 

~ — > is a finite binary relation on (V U NT U T)* such that if r — > s then 
r € NT, s yf e and s does not have any variable appearing more than once. 
The elements of are called 

CW-grammars differ from Van Wijngaarden’s original W-grammars in that 
pseudo-rules have been constrained. The original W-grammars allow pseudo- 
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rules to have variables on the left-hand side as well as repeated variables on 
both the right- and left-hand side. The constrained version defined above yields 
a dramatic reduction in the expressive power of W-grammars, but, as we will 
see below, at the same time it allows us to capture state of the art parsers. 

CW-Grammars are rewriting devices, and as such they consist of rewriting 
rules. They differ from the usual rewriting systems in that the rewriting rules do 
not exist a-priori. Using pseudo-rules and meta-rules one builds ‘real’ rules that 
can be used in the rewriting process. The rewriting rules produced are denoted 
by =^. These rules are built by first selecting a pseudo-rule, and then using 
meta-rules for instantiating all the variables the pseudo-rule might contain. 

For example, let W = (V, NT, T, S, be a CW-grammar where 

{AVJ}, NT = {S, . , },T={ , , , , ,...}, while 

— > and — > are given by the following table: 



meta-rules 


pseudo-rules 


AVJ ^ AVJAdj 
AVJ ^ . 


S^ AVJ 

S 

S 



Suppose now that we want to build the rule S , . We 

take the pseudo-rule S AT>J and instantiate the variable AT>J with 

to get the desired rule. The rules defined by W have the following shape: 
S * . Trees for this grammar are flat, with a main node S and all 

the adjectives in it as daughters; see Figure 1. 

The L(W) generated by a CW-grammar W is the set {/3 G 

T+ : S ^4- /?}. In words, a string (3 belongs to the language L{W) if there is a 
way to instantiate rules that derive (3 from S. A yielding a string I 

is defined as the derivation producing 1. A W-tree ‘pictures’ the rules (i.e., 
pseudo-rules -I- variable instantiations) that have been used for deriving a string; 
Figure 1(a) shows an example. The way in which a rule has been obtained from 
pseudo-rules or the way in which its variables have been instantiated remains 
hidden. The generated by a CW-grammar W is the set T{G) 

defined by all W-trees generated by W yielding a string in L{G). 

Theorem 1. 

Let W = {V, NT, T, S, ^-^-) be a CW-grammar. Let Gw = 
{NT' ,T' , S' , R') be a context-free grammar defined as follows (to avoid con- 
fusion we denote the rules in R by ^): NT' = (U U NT); T' = T; S' is the 
starting symbol in W; and A ^ a G i? iff A a or A q,^ can be shown 
that Gw is well-defined and generates the same language as W. 

Given a CW-grammar W, the . W, notation 

{W), is the grammar Gw defined in the proof of Theorem 1. In Figure 1 
we show a derivation in W and the corresponding one in {^)- 
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S 




Adj Adj Adj Noun 



red big green ball 



(a) 



Noun 

I 

ball 



AVJ Adj 



AVJ Adj 

I I 

Adj big 

I 

red 



green 



(b) 



S 



A A A B B 

I I I I 

a a a b b 



(c) 



Fig. 1. (a) A tree generated by W. (b) The same tree with meta-rule derivations made 
visible, (c) A derivation tree for the string “aaabb” 



Lemma 1 . W G = {W) . r 

T(G) V G T{W) V 



We sketch the proof using Figure 1. Suppose we want to derive the W- 
tree corresponding to the tree in Figure 1(a) from the one in Figure 1(b). Besides 
the G-tree we need to know which rules in G are meta-rules in W and which 
non-terminals in G are variables in W. To obtain a VF-tree from a G-tree we 
replace all variables in CFG-rules (corresponding to pseudo-rules) by the yield 
of the CFG-derivation (corresponding to a meta-derivation). To illustrate this 
idea, consider the yield below the variable *ADJ'* in Figure 1(b): 

‘hide’ the meta-derivation producing it, thus obtaining the tree in Figure 1(a), 
the VF-tree. Since such a replacement procedure is uniquely defined, for every 
tree in T(G) there is a unique way to hide meta-derivations, consequently for 
every G-tree there is a unique IF-tree, as desired. 



Next, we give an example to show that GW-grammars are strongly 
equivalent to context-free grammars. In other words, trees generated by GW- 
grammars are different from trees generated by context-free grammars. 



Let W = a GW-grammar with V = {A, B, 

S}, NT = {A,B}, T = {a, h}, ^ = {A ^ AA, A ^ A, B ^ BB, 
B ^ B}, and ^ = {A ^ a, B ^ b, S ^ AB}. 



The grammar W generates the language {a* 6*} through instantiations of the 
variables A and B to strings in A* and B* respectively. The derivation 
for a string aaabb is as follows: S 

=A- . The tree representing this derivation (Figure 1(c)) 

has only one internal level (labeled ), and its leaves form the accepted 

string. Observe that no GFG can generate the kind of flat structures displayed 
in Figure 1(c) since any context-free grammar producing the same language as 
W will have more than one intermediate level in its derivation trees. 
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2.2 Probabilistic CW-Grammars 

(PCW-grammars, for short) are CW-grammars in 
which the rules are augmented with a probability value, such that the proba- 
bilities belonging to rules sharing the same left-hand side sum up to one. More 
formally, in a probabilistic CW-grammar (V, NT, S, — >, — >) we have that 

— ^ m p = 1 for all meta-rules x y having x in the left-hand side. 

— = 1 for all pseudo-rules x -^-^p y having x in the left-hand side. 

Next, we need to define how we assign probabilities to derivations, rules, and 
trees. To start with derivations, if a' — > a then there are a±, . . . , ak such that 
ai — > Oi+i, «i = a' and Uk = a. We define the probability P(a' — > a) of a 
derivation a' a to be ^(o:i «i+i)- 

Now, let X a be a rule. The probability P(X a) is defined as the 
product of P(a' a) and Xa'e^i o;'), where 

A = {a' € (V U NTUT)+ : X ^ a', a' ^ a}. 



I.e., the probability of a ‘real’ rule is the sum of the probabilities of all meta 
derivations producing it. 

The , is defined as the product of the probabilities of the 

rules making up the tree, while the . a G T~^ is defined as 

the sum of the probabilities assigned to all trees yielding a. 



Theorem 2. W 



W 

G 



G 



G' 



W) 



(W), W' 

w 

W' G' 

• G f 



Let G = {XT', r, S', R') be a PCFG with NT', T , S' as defined in the 
proof of Theorem 1 and R' such that X ^ a & RiA X lllfo a or X a. Note 
that a derivation t might be the product of many different derivations using 
rules in R! (G-derivations for short); call this set D{t). From the definitions 
it is clear that p(r) = XpGD(r) "I"® prove the theorem we need to show 
(1) that for T and t' two different => derivations of the string a, it holds that 
D{t) n D{t') = 0, and (2) that for every G-derivation v there is a ==> derivation 
T such that V G D{t). Both results follow from Lemma 1. 



For a given PCW-grammar W, the PCFG defined in the proof of Theorem 2 
is called . W. As an immediate consequence of the construc- 

tion of the PCFG given in Theorem 2 we get that a PCW-grammar is 
iff its underlying PCFG is consistent. 
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2.3 Learning CW-Grammars from Tree-Banks 

PCW-grammars are induced from tree-banks in almost the same way as PCFGs 
are. The main difference is that the former require an explicit decision on the na- 
ture of the hidden derivations. As we will see below, the two different approaches 
to natural language parsing that we present in this paper differ substantially in 
the assumptions they make in this respect. 

2.4 Some Further Technical Notions 

Below we will use PCW-grammars to “capture” models underlying a number of 
state of the art parsers. The following will prove useful. Let F and G be two 
grammars with tree languages T{G) and T{F) and languages L{F) and L{G), 
respectively. Then, F is f to G if L{F) = L{G) and there is a bijective 

function / : T{F) T{G). Given two grammatical formalisms A and B, we say 
that A is / to B, if for every grammar F in A there is a grammar 

G in B such that F is /-equivalent to G. 

3 Capturing State of the Art Parsers 

In this section we use PCW-grammars to capture the models underlying two 
state of the art parsers. 

3.1 Bilexical Grammars 

Bilexical grammar [4,5] is a formalism in which lexical items, such as verbs 
and their arguments, can have idiosyncratic selectional influences on each other. 
They can be used for describing bilexical approaches to dependency and phrase- 
structure grammars, and a slight modification yields link grammars. 

Background. A i? is a 3-tuple {W, {vwjweWj 

{lw}wew) where: 

— IT is a set, called the (terminal) , , which contains a distinguished 

symbol ROOT 

— For each word w G W, a pair of regular grammars and r^, having start- 
ing symbols Si^ and Sr^, respectively. Each grammar accepts some regular 
subset of W*. 

A . is a tree whose nodes (internal and external) are labeled 

with words from W; the root is labeled with the symbol ROOT. The children 
(‘dependents’) of a node are ordered with respect to each other and the node 
itself, so that the node has both that precede it and 

that follow it. A dependency tree T is if for every word token w that 

appears in the tree, Iw accepts the (possibly empty) sequence of w's left children 
(from right to left), and accepts the sequence of re’s right children (from left 
to right). are like unweighted bilexical grammars 

but all of their automata assign weights to the strings they generate. Lemma 2 
implies that weighted bilexical grammars are a subset of PGW-grammars. 
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Bilexical Grammars as CW-Grammars. With every bilexical grammar B 
we associate a CW-grammar Wb as follows. 

Definition 1. Let B = {W, {Iw}wgl}, {rw}wGL) be a split bilexical grammar. 
Let Wb = {V, NT, T, S, be the CW-grammar defined as follows: 

— The set of variables V is given by the set of starting symbols 5*;^ and Sr„ 
from regular grammars 1^ and respectively, and w in W . 

— The set of non-terminals NT is some set in 1- 1-correspondence with W, e.g., 
it can be defined as NT = {w' : w gW}. 

— The set of terminals T is the set of words W. 

— The set of meta-rules is given by the union of {tu' w : w G W} and the 
rules in all of the right and left regular grammars in B. 

— The set of pseudo-rules is given by X' where 1^ denotes the 

regular expression inverting (reading backwards) all strings in L(l^). 

Below, we establish the (weak) equivalence between a bilexical grammar B 
and its CW-grammar counterpart Wb- The idea is that the set of meta-rules, 
producing derivations that would remain hidden in the tree, are used for simu- 
lating the regular automata. Pseudo-rules are used as a nexus between a hidden 
derivation and a visible one: For each word w in the alphabet, we define a pseudo- 
rule having w as a terminal, and two variables 5*;^ and marking the left and 
right dependents, respectively. These variables correspond to the starting sym- 
bols for the left and right automata Iw and r^, respectively. Instantiating the 
pseudo-rule associated to w would use a left and a right derivation using the 
left and the right automata, respectively, via meta-rules. The whole derivation 
remains hidden in the =4> derivation, as in bilexical grammars. 

Lemma 2. / 

We have to give a function / : T{B) T{Wb), where B is a bilexical 
grammar and Wb the grammar defined in Definition 1, such that / is invert- 
ible. A bilexical tree yielding the string s = Wi, . . . ,Wn can be described as 
a sequence Ui,...,m„ of 3-tuples {ai,Wi,(3i) such that accepts a* and 
accepts /3. The desired function / transforms a dependency tree in a W-tree 
by transforming the sequence of tuples into a derivation. We define / as 
f{{a,Wi,/3)) = Wi aWiP. The rule corresponding to (a,Wi,P) is the one 
produced by using the pseudo rule W[ Sj xSr,^ and instantiating Sj and 
Sr„, with a and /? respectively. Since the sequence of tuples forms a dependency 
tree, the sequence of W-rules builds up a correct W-tree. 

Expressive Power and Gonsistency. By Lemma 2 bilexical grammars are 
weakly equivalent to context-free grammars. Moreover, the idea behind Exam- 
ple 1 can be used to prove that bilexical grammars are not strongly equivalent to 
CFGs. Briefly, bilexical grammars can create fiat structures of the kind produced 
by the grammar in Example 1; such structures cannot be produced by CFGs. 

As a consequence of Lemma 2, learning bilexical grammars is equivalent to 
learning PCW-grammars, which, in turn, is equivalent to learning the PCFGs 
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underlying the PCW-grammars. Eisner [4] assumed that all hidden derivations 
were produced by Markov chains. Under the PCW-paradigm, his methodology is 
equivalent to transforming all trees in the training material by making all their 
hidden derivations visible, and inducing the underlying PCFG from the trans- 
formed trees. Variables in the equivalent PCW-grammar are defined according 
to the level degree of the Markov chain (we assume that the reader is familiar 
with Markov models and Markov chains [11]). In particular, if the Markov chain 
used is of degree one, variables are in one-to-one correspondence with the set of 
words, and the consistency result follows from the fact that inducing a degree 
one Markov chain in a bilexical grammar is the same as inducing the underly- 
ing PCFG in the equivalent PCW-grammar using maximum likelihood, plus the 
fact that using maximum likelihood estimation for inducing PCFGs produces 
consistent grammars [2,8]. 

3.2 Stochastic Tree Substitution Grammars 

Data-oriented parsing (DOP) is a memory-based approach to syntactic parsing. 
The basic idea is to use the subtrees from a syntactically annotated corpus 
directly as a stochastic grammar. The DOP-1 model [1] was the first version 
of DOP, and most later versions of DOP are variations on it. The underlying 
grammatical formalism is stochastic tree substitution grammars (STSG), which 
is the grammatical formalism we capture here. 

Background. The model itself is extremely simple and can be described as fol- 
lows: for every sentence in a parsed training corpus, extract every subtree. Then 
we use these trees to form an stochastic tree substitution grammar. Formally, a 

(STSG) G is a 5-tuple (Vat, Ur, S, R, P) 

where 

— Ujv is a finite set of nonterminal symbols. 

— Ur is a finite set of terminal symbols. 

— S GVn is the distinguished symbol. 

— i? is a finite set of elementary trees whose top nodes and interior nodes 
are labeled by nonterminal symbols and whose yield nodes are labeled by 
terminal or nonterminal symbols. 

— P is a function which assigns to every elementary tree t G R a, probability 

P{t). For a tree t with a root node symbol root{t) = a, P{t) is interpreted 
as the probability of substituting t on a node a. We require, therefore, for a 
given a that ^^froot{t)=a} ~ ^ ^ ^ (where t’s root node 

symbol is a). 

If ti and t 2 are elementary trees such that the leftmost non-terminal frontier 
node symbol of ti is equal to the root node symbol of t 2 , then ti o t 2 is the 
tree that results from substituting t 2 in this leftmost non-terminal frontier node 
symbol in ti. The partial function o is called or simply 

. Trees are derived using left most substitution. 
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A 




BAB 



(a) 



S 




A C 



a b 



(b) 



Fig. 2. (a) A derivation tree (b) An elementary tree 



STSGs as CW-Grammars. STSGs are not quite context-free grammars. The 
main difference, and the hardest to capture in a CFG-like setting, is the way in 
which probabilities are computed for a given tree. The probability of a tree 
is given by the sum of the probabilities of all derivations producing it. GW- 
grammars offer a similar mechanism: the probability of the body of a rule is 
the sum of the probabilities of all meta-derivations producing it. The idea of 
the equivalence is to associate to every tree produced by a STSG a ‘real’ rule of 
the PGW-grammar in such a way that the body of the rule codifies the whole tree. 

To implement this idea, we need to code up trees as strings. The simplest 
way to achieve this is to visit the nodes in a depth first left to right order and 
for each inner node use the applied production, while for the leaves we type the 
symbol itself if the symbol is a terminal and a primed version of it if the symbol 
is a non-terminal. For example, the derivation describing the tree in Figure 2(a) 
is {A,BAB)B'{A,BAB)B'A'B'{B,a)a. 

The first step in capturing STSGs is to build rules capturing elementary 
trees using the notation just introduced. Specifically, let t be an elementary 
tree belonging to a STSG. Let S be its root and a its string representation. 
The GF-like rule S" — > a is called the , of t. Elementary rules 

store all information about the elementary tree. They have primed non-terminals 
where a substitution can be carried out. E.g., if t is the elementary tree pictured 
in Figure 2(b), its elementary rule is S' {A, B){A,ab)ab{B , AC){A,ab)abC . 

Note the primed version of C in the frontier of the derivation. 

Definition 2. Let H = {Vn,Vt, S, R, P) be a STSG. Let Wh = {V, NT, T, 
S' , —N) be the following GW-grammar. 

— E is the primed version of Vt- 

— {A, a) is in NT iff (A, a) e appears in some elementary tree. 

— T is exactly as Vr- 

— S" is a new symbol. 

— The set of meta-rules is built by transforming each elementary tree to its 
corresponding elementary rule. 

— The set of pseudo-rules is given by {A, a) e if A ^ a appears in a 
elementary tree, plus rules S' S. 
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Two remarks are in order. First, all generative capacity is encoded in the set 
of meta-rules. In the CW-world, the body of a rule (i.e., an instantiated pseudo- 
rule) encodes a derivation of the STSG. Second, the probability of a ‘real’ rule 
is the sum of the probabilities of meta-derivations yielding the rule’s body. 

Lemma 3. / 

Let H = (Vn,Vt, S, R, P) be a STSG and let Wh be the GW-grammar 
given in Definition 2. Let t be a tree produced by H. We prove the lemma using 
induction on the length of the derivation producing t. If t has length 1, there is 
an elementary tree ti such that S is the root node and yields a, which implies 
that there is a meta-rule obtained from the elementary rule corresponding to the 
elementary tree ti. The relation is one-to-one as, by definition, meta-rules are in 
one-to-one correspondence with elementary trees. 

Suppose the lemma is true for derivation lengths less than or equal to n. 
Suppose t is generated by a derivation of length n+ 1. Assume there are trees 
ti, t 2 with t\ot 2 = t. By definition there is a unique meta-rule ri corresponding 
with ti and by inductive hypothesis there is a unique derivation for t 2 - 

Corollary 1. H = {Vn,Vt, S, R, P) , Wh 

H Wh 

Lemma 4. H = {Vn,Vt,S,R,P) , Wh 



A tree has a characteristic W-rule, defined by its shape. Moreover prob- 
ability of a W-rule, according to the definition of PGW-grammars, is given by 
the sum of the probabilities of all derivations producing the rule’s body, i.e., 
all STSG derivations producing the same tree. As a consequence, a particular 
STSG tree, identified from the body of the corresponding W-rule, has the same 
probability assigned by the equivalent GW-Grammar. 

Expressive Power and Consistency. By Corollary 3, STSGs are weakly 
equivalent to context-free grammars. The consistency of an STSG depends on 
the methodology used for computing the probabilities assigned to its elementary 
trees. DOP-1 is one particular approach to computing these probabilities. Under 
the DOP-1 perspective, a tree t contributes all its possible subtrees to a new 
tree-bank from which the probabilities of elementary trees are computed. Prob- 
abilities of an elementary tree are computed using maximum likelihood. Since 
the events in the new tree-bank are not independently distributed, the resulting 
probabilities are inconsistent and biased [9]. Solutions taking into account the 
dependence between trees in the resulting tree-banks have been suggested [12]. 

Even though consistency conditions cannot be derived for the DOP-1 esti- 
mation procedure given that it does not attempt to learn the underlying PCFG, 
our formalism suggests that probabilities should be computed differently from 
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the way it is done in DOP-1. By our embedding, a tree t in the tree-bank corre- 
sponds to the body of a pseudo-rule instantiated through meta-derivations; t is 
the final “string” and does not have any information on the derivation that took 
place. But viewing f as a final string changes the problem definition! Now, we 
have as input a set of elementary rules and a set of accepted trees. The problem 
is to compute probabilities for these rules: an unsupervised problem that can be 
solved using any unsupervised technique. The consistency of the resulting STSG 
depends on the consistency properties of the unsupervised method. 



4 Discussion and Conclusion 

We introduced probabilistic constrained W-grammars, a grammatical framework 
capable of capturing a number of models underlying state of the art parsers. We 
established expressive power properties for two formalisms (bilexical grammars, 
and stochastic tree substitution grammars) together with some conditions under 
which the inferred grammars are consistent. We should point out that, despite 
their similarities, there is a fundamental difference between PCW-grammars and 
PCFGs, and this is the two-level mechanism of the former formalism. This mech- 
anism allows us to capture two state of the art natural language parsers, which 
cannot be done using standard PGFGs only. 

We showed that, from a formal perspective, STSGs and bilexical grammars 
share certain similarities. Bilexical grammars suppose that rule bodies are ob- 
tained by collapsing hidden derivations. That is, for Eisner, a rule body is a 
regular expression. Similarly, Bod’s STSGs take this idea to the extreme by tak- 
ing the whole sentence to be the yield of a hidden derivation. PG W-grammars 
naturally suggest different levels of abstraction; in [6] we have shown that these 
levels can be used to reduce the size of grammars induced from tree-banks, and, 
hence, to optimize parsing procedures. 

From a theoretical point of view, the concept of /-transformable grammars, 
which we use heavily in our proofs, is a very powerful one that relaxes the known 
equivalence notions between grammars. Since arbitrary functions / can be de- 
fined between arbitrary tree languages and GFG-like trees, they can be used 
to map other formalisms to context-free trees. Examples include Gollins’ first 
model (based on Markov rules) [3], Tree Adjoining Grammars [10] or Gatego- 
rial Grammars [14]. As part of our future research, we aim to capture further 
grammatical formalisms, and to characterize the nature of the functions / used 
to achieve this. 
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Abstract. Building a generalized platform for having dialogues is a hard problem 
in the topic of dialogue systems. The problem becomes still more difficult if there 
is the possibility of having a conversation about more than one domain at the 
same time. To help solving this problem, we have built a component that deals 
with everything related to the domains. In this paper, we describe this component, 
named Service Manager, and explain what and how it passes all the information 
that a dialogue manager needs to conduct a dialogue. 



1 Introduction 

Only recently Spoken Dialogue Systems (SDS) started emerging as a practical alternative 
for a conversational computer interface, mainly due to the progress in the technology 
of speech recognition and understanding[l]. Building a generalized platform for having 
dialogues is a hard problem. Our main goal is to have a system easely adaptable to new 
dialogue domains, without changing any code or structures that lie behind the system[2] . 

When we turned our efforts to implement a natural language dialog system, we 
selected an architecture that could satisfy the requirements of adaptability to different 
domains. Our dialog system follows the TRIPS architecture, and consists of four main 
modules: Interpretation Manager (IM), Discourse Context (DC), Behavioural Agent 
(BA), and Generation Manager (GM)[3]. 

Faced with a multiplicity of possible dialogues, we restricted our manager to task- 
oriented dialogues in order to reduce the information acquisition needs. For each domain, 
it is necessary to define the actions available to the user. The dialog manager’s task is to de- 
termine the action intended by the user and request the information needed to perform it. 

We use frames to represent both the domain and the information collected during the 
interaction with the users [4] . Each domain handled by the dialogue system is internally 
represented by a frame, which is composed by slots and rules. Slots define domain data 
relationships, and rules define the system behaviour. Rules are composed by operators 
(logical, conditional, and relational) and by functions. 

To keep the filling of the frame slots consistent, it is necessary to indicate the set of 
values with which a slot can be instantiated. To avoid invalid combination of values, we 
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Fig. 1. Architecture of the Dialogue Manager 

have defined a meta-language to express the constraints that must be satisfied. So, each 
frame definition includes a set of recognition rules, used to identify the objects that come 
with an user utterance and to specify the set of values that each slot may hold; a set of 
validation rules, to express a set of domain restrictions, i.e., invalid combination of slot 
values; and classification rules, used to specify the actions that must be performed when 
some conditions are satisfied, i.e., the valid combinations of values. This approach has 
the advantage of making the dialogue control independent of the domain. 

We have designed a Service Manager (SM) as the interface between the spoken di- 
alogue platform and a set of heterogeneous devices. The role of the SM is to provide 
the dialogue manager with everything that is related to a domain, which includes repre- 
sentation, access restrictions and, most important of all, dialogue independence. More 
than one device can be part of a single domain. The SM module determines what can a 
user really do or execute in a task oriented dialogue system, that is, the services. Inside 
a domain, a set of services is grouped in one or more states. The services returned by 
the SM to the dialogue manager will be restricted to the current state, not allowing users 
doing things that they are not supposed to do, like turning on the lights when the lights 
are already turned on. 

Our spoken dialog platform has been used to produce a spoken dialog system called 
“house of the future”, and it is also being used on a demonstration system, accessible 
by telephone, where people can ask for weather conditions, stock hold information, bus 
trip information and cinema schedules. 

This paper is divided into six sections. The next section describes the Dialogue 
Manager, and its actual four main components. Section three describes Service Manager, 
and its main role to achieve dialogue domain independence. In section four we present 
a dialogue example, showing the fluxes between the dialogue manager and the Service 
Manager. Next to last, we describe our real experience in the implementation of our 
dialogue system in controlling parts of the “house of the future”. Finally, we present 
some concluding remarks and future work. 
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2 Dialogue Manager 

Our Dialogue Manager is composed by four main modules: (i) Interpretation Manager; 
(ii) Discourse Context; (iii) Behavioral Agent; and (iv) Generation Manager; 

The Interpretation Manager (IM) receives a set of speech acts and generates the cor- 
respondent interpretations and discourse obligations [5] [6] [7]. Interpretations are frame 
instantiations that represent possible combinations of speech acts and the meaning as- 
sociated to each object it contains[l]. To select the most promising interpretation two 
scores are computed. The recognition score to evaluate the rule requirements already ac- 
complished, and the answer score, a measure of the consistency of the data already 
provided hy the user. A more detailed description of this process can he found in 
[ 8 ]. 

The Discourse Context (DC) manages all knowledge about the discourse, including 
the discourse stack, turn- taking information, and discourse obligations. 

The Behavioral Agent (BA) enables the system to he mixed-initiative: regardless of 
what the user says, the BA has its own priorities and intentions. When a new speech 
act includes objects belonging to a domain that is not being considered, the BA as- 
sumes the user wants to introduce a new dialog topic: the old topic is put on hold, 
and priority is given to the new topic. Whenever the system recognizes that the user 
is changing domains, it first verifies if some previous conversation has already taken 
place. 

The Generation Manager (GA) is a very important component in a dialog manager. To 
communicate with the user, it has to transform the system intentions in natural language 
utterances. It receives discourse obligations from the Behavioral Agent, and transforms 
them into text, using template files. Each domain has a unique template file. The Gen- 
eration Manager uses another template file to produce questions that are not domain 
specific. For example, domain desambiguation questions, used to decide proceeding a 
dialogue between two or more distinct domains, or to clarify questions, are defined in 
this file. 



3 Service Manager 

We designed the Service Manager (SM) as the interface between the spoken dialogue 
platform and a set of heterogeneous domains [2]. A domain is a representation of a set of 
devices that share the same description. This description is composed hy slots and rules. 
Slots define domain data relationships, and rules define the system behaviour. A rule 
or service represents an user possible action. The spoken dialogue platform has been 
applied to several different types of devices. They are splitted into two big sets: (i) in 
a home environment; and (ii) on a database retrieval, see Figure 2. In the first, based 
on XIO and IRDA protocols, we control a wide range of home devices, such as, lights, 
air conditioning, hi-fi, and TV. We can extend the application to include any infra-red 
controllable device or whose control functions may be programmed by the XIO protocol. 
The second type of application, information retrieval via voice from remote databases, 
has been tested with weather information, cinema schedules and bus trip information. 
This type of application can easily be extended to other domains. 
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Fig. 2. Interaction with the Service Manager 



3.1 Interface Operations 

The Service Manager component of the Spoken Dialogue System provides the following 
main set of operations: 

- Domain Description Insertion: this operation inserts a new domain in the Service 
Manager. It receives as argument a path to a file that contains a XML description 
of a domain. After the completion of this operation, a Service Manager internal 
representation of the domain is maintained, for future requests. 

- Objects Recognition: to become possible the construction of interpretations and 
discourse obligations by the Interpretation Manager, an identification of every speech 
utterance object is crutial. Whenever this identification is needed, this operation will 
be invoked, using as arguments : (i) a list with the name of all objects to be recognized; 
and (ii) the current user. This operation returns a list with the description of the 
identified objecfs. Objecfs belonging to domains where the current user does not 
have access are not checked. If an object is not recognized, the name of the object 
will be returned without a description. 

- Domain Services Return: a service represents an user possible action. This operation 
returns the set of services associated with the specified domain(s). In a task oriented 
dialogue system, these services are used to determine the system behavior: an answer 
or the execution of a procedure. The operation receives as arguments: (i) the domains; 
and (ii) the current user. It is possible to ask for all domain services, or ask for services 
from specific domains. The current user is used to check if the user has access to 
the services. 

- Execution of Services: this operation allows for execution of services that belong 
to a domain in the Service Manager. More than one service can be ordered to be 
executed at the same time. 

The use of the Service Manager with these operations provides the following features: 

- A dialogue independent component that offers all the information related with do- 
mains. The Service Manager becomes the only “door” by which the SDS components 
gets information about a domain. 
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- There is no mixture between the linguistic parts of the SDS and information about 
the domains. 

- Because information about the domains is concentrated, it is more easier to define 
layers of reasoning using that information and extract knowledge from it. 

- As we are working with different types of devices there was the need to define 
an absfracf represenfafion to create an uniform characterization of domains. The 
definition of this abstract description allows for an equal treatment of the domains. 

- A new domain is more easily introduced. 



3.2 Architecture 

Figure 3 shows the service manager architecture. It is composed by five modules: (i) 
Service Manager Galaxy Server; (ii) Device Manager; (iii) Access Manager; (iv) Object 
Recognition Manager; and (v) Device Proxy. 




ServiceManager 
Galax yServer 



Device 

Manager 



Access 

Manager 



Object 

Recognition 

Manager 



Device 

Proxy 



Fig. 3. Service Manager Architecture 



The Service Manager Galaxy Server defines the interface of the service manager. So, 
it is the unique point of communication from where all requests and responses are made. 
All the operations in the interface that were explained before are defined here. This 
module follows a design pattern named “facade”. This pattern provides a simplified, 
standardized interface to a large set of objects. This avoids the problem of having to 
reference many different, complicated interfaces to each object of that set. 

The Device Manager is the service manager main module. It is the business layer of 
the architecture. It has the main methods by which the service manager works, including 
the ones that deal with the service requests. 

The Access Manager component main function is to serve as a filter. This filter is 
applied for all requests of services. It restricts the available services to the current user. 
For doing that, this module manages a device restrictions file, and checks whefher or not 
that user has permission to access to specific devices. 

The Object Recognition Manager component is responsible for the identification of 
all objects. This component manages the XML description object files for each domain 
available in the system. Figure 4 shows part of a lights domain recognition objects file. 
Two actions are available, turn-on and turn-off. This means that these two objects are 
identified as being actions in the lights domain. 
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<objects> 

<action> 

<item> turn-on </item> 
<item> turn-off </item> 
</action> 

</objects> 



Fig. 4. Part of a recognition objects file 



The Device Proxy module keeps the connection to the real device (the one that 
implements services). For some devices, it is impossible to obtain its state directly from 
it. For example, it is difficult to check if a televison is turn-on or turn-off. Most part of 
the televisions do not have sensors to do this job. By this reason, this module is also 
capable of creating and managing a device state. This device state can be changed only 
when a service is executed. The system only uses this capability when no state from the 
device can be obtained, else, it goes directly to the device to check the current state. 

The representation of the services is based on a XML description, validated by a 
generic DTD. DOM trees, resulting from the XML processing, are used as the internal 
representation of each device. A device might have more than one state. A state can 
include more than one service. At a given instant, it is only possible to have access to 
the hierarchy of services available in the actual state of the device. 



<service name=" turn-on-light "> 

<label>Turns on the bedroom light</ label> 
<params> 

<param action value= " turn-on" ></param> 
<or> 

<param action value= " turn-on" ></param> 
<param zone></param> 

</or> 

</params> 

<execute> 

<name>turn-on</name> 

<success>on</sucess> 

<f ailure/> 

</execute> 

</service> 



Fig. 5. Description of a service 



Figure 5 contains a description of a service. It is only part of a large set of services 
of a domotic domain we have implemented. It shows a description of a service that is 
used to turn on the lights of two different zones of a house. Nested to the XML tag 
“params”, there are explicit the preconditions for the service execution. By default, the 
logical operator “and” is used. In this case, there are two possible ways of executing a 
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service: (i) provide only an action with value equals to “turn-on”. This means that a nser 
can only say “turn-on” to the system to turn the lights on; (ii) provide an action with 
valne equals to “turn-on”, and the zone where the user wants to act. 

The tag named “execnte” dehnes the behavior for the execution of the service. The tag 
“name” indicates the function code name that is executed when the service is selected. 
The tag “sucess” indicates the new state of the device. This new state: “on”, does not 
include, naturally, the “turn-on-light” service, bnt it includes a “turn-off-light” service. 
Theoretically, if we tnrn on a light we cannot tnrn on the same light again, but we are 
able to turning it off. 

The SM can also restrict the access of certain users to the services. There are three 
types of devices: (i) public, (ii) private, and (iii) restricted devices. The public devices 
are available for all users. Whenever the SM looks for services, these ones are immedi- 
ately considered. The private devices are only available for those nsers that have gained 
privilege access, and therefore, have inserted a key or password in the system. The re- 
stricted devices are even more specific than the private ones. They demand particular 
users, and no one else can access those devices besides them. Figure 6 shows part of 
a file defining the access information related to three devices. The first device named 
“Futnre House Bedroom Light Controller”, is declared public, so everyone has access to 
it. Underneath is a private device named Futnre Honse Bedroom TV Controller, meaning 
that nsers who want to play with the TV controller mnst have privilege access. At the 
bottom is a restricted device. In this case, only the user named “marcio” can access the 
device. 



<device hashcode="589654695" type= "public " > 
<name>Future House Bedroom Light Control ler</name> 
<f irst_access>1083357587921</f irst_access> 
<last_access>10 842 110711 09</last_access> 

<users/> 

</device> 

<device hashcode="614235759" type= "private " > 
<name>Future House Bedroom TV Controller</name> 
<first_access>10842 04657 484</ firs t_access> 
<last_access>1084204716140</last_access> 

<users/> 

</device> 

<device hashcode="614235761" type= " restricted" > 
<name>Future House Bedroom Glass Controller</name> 
<first_access>10842 06473 546</ firs t_access> 
<last_access>10 842 110711 09</last_access> 

<users> 

<user>marcio</user> 

</users> 

</device> 



Fig. 6. Device restriction file content example 
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3.3 Addition of a New Domain 

To add a new dialogue domain, it is necessary to: (i) define a description of the do- 
main in XML. This description includes the new services, see figure 5; (ii) define the 
implementation code that will real execute the services. This is a kind of device driver; 
(iii) define a file in XML with the identification of the domain objects. This file will be 
used to make the recognition of the objects, see Figure 4. (iv) Define a template file in 
XML of the system domain utterances. The Generation Manager module makes use of 
this file, but it is currently loaded by the Generation Manager and not requested to the 
Service Manager. Figure 7 shows some utterances used by the dialogue system in the 
lights domain. The tag “requestslot” is used to ask for a specihc slot value. When an 
action must be fullfilled, one of the three utterances inside the scope of the tag “action” 
is ramdomly choosed. If the system wants to ask for a zone to act, the utterance nested 
to tag “zone” is sent to user with “ [ $action] ” being actually replaced for an action 
previously provided by the user. 



<requestslot> 

<action> 

<utt> Do you wish to turn-on or turn-off the light? </utt> 
<utt> What are your intentionsPYou want the lights on or off? 
</utt> 

<utt> Do you wish to raise or diminish the light intensity? 
</utt> 

</action> 

<zone> 

<utt> What light do you want to [$action]? </utt> 

</zone> 

</requestslot> 



Fig. 7. Part of a template system utterances file 



4 Dialogue Flow Example 

In this section we present a small example of how our dialogue system works, including 
some of the messages interchanged between the components. This example makes the 
assumption that the dialogue system controls the lights and the television of a home 
division. The system can turn on and turn off the lights as well as the television. The 
difficulty is extended because there exists two independent light zones, where the user 
can apply the light operations. It is possible to apply operations to one zone independent 
of the other. For example, turn-on the lights of the A zone and after, turn-off the lights 
of the B zone. 

1 . The user starts by saying “turn-on”. The Interpretation Manager (IM) component 
receives the speech act associated with this phrase. To build the interpretations, the 
IM needs to know what it means in the available domains. 
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2. The IM requests Service Manager (SM) to give information about the “turn- 
on” object. The SM grasps the available and accessible user domains and searches 
a database for the object with that information. 

3. The SM responds to the IM. The response gives the characteristics of the object 
to the IM. In this case, the object is recognized in two different domains. The word 
is an action that belongs to the lights domain and to the television domain. 

We should note that at any time, what is recognized or not, is totally dependent 
on SM. So, if for some reason, we decide to pull out the lights device from the 
domain, turning it inaccessible, the next time recognition is made, the answer would 
be positive, but only for the television domain. 

4. The IM creates two interpretations. Because the object was recognized in two 
different domains, the IM creates a Discourse Obligation (DO) accordingly. In this 
case, the IM creates a Domain Desambiguation DO. 

5. The IM sends the DO created to the Behavioral Agent (BA). 

6. The BA selects this DO as the next system action, and sends it to the Generation 
Manager (GM). 

7. The GM uses the domain independent template file to generate a desambigua- 
tion question. In this case the system generates a question like; “What do you 
pretend to turn-on, the lights or the tv?”. We should note that the system uses a sin- 
gle independent domain template file to generate all the desambiguation questions. 
In this one, “turn-on” is the intendent action and, “lights”, or “tv”, are the domain 
names associated with the Service Manager. 

Lets suppose the user says that he wants to turn on the lights. In other words, the 
user answer was “the light”. With this answer, the desambiguation is over, and IM 
will deal with only one interpretation. 

8. IM requests SM for services associated with the light domain. IM needs to know 
now how to proceed the dialog. To do that, IM requests SM information about all 
the services associated with that particular domain. 

9. SM processes the IM request. SM retrieves information about all services associ- 
ated with the particular domain. As said before, these services define the behavior 
of the system, and constitutes what can be done by the Dialogue Manager. We are 
assuming that the user has the necessary privileges to access these services. 

10. The SM returns a response to IM. 

1 1 . The IM processes the SM answer. The SM has found at least two services for the 
light domain: (i) the turn on; and (ii) the turn off service. But because the user has 
already said that he wants to turn on the lights (the Discourse Context provides this 
information), only this service is considered. 

12. IM creates a DO for execution of the service. 

5 House of the Future 

We recently extended the demo room at our lab to the project known as “Casa do 
Futuro”. This project consists of a house where are installed some of the latest technology 
inovations in smart-house. We have installed our system in the main bedroom. In this 
division there is: (i) one big tv with split screen capabilities; (ii) two independent light 
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zones; and (iii) a glass with the ability to change its state to opaque or lucid. It was 
not possible to obtain the current state of the television and of the glass. No feedback 
directly from these devices could then be used. For that reason, except for the lights 
where we could obtain their intensity, the current state of the television and of the glass 
was maintained in the Service Manager, and updated by the execution of operations on 
the device. 

Besides the integration of the recognition and synthesis technologies with the spoken 
dialogue system, we also developed an animated character with lypsinc capability [9] . So, 
whenever a visitor wants to control a device, he calls our virtual buttler, and he appears 
on the tv set, waiting for any order, and giving feedback when something happens. 



6 Concluding Remarks 

The work reported in this paper results from an integration of several components being 
developed in our lab. This system is the result of a collaborative effort between people 
working in the different technologies, and it became a common platform where the 
different components can be associated to produce multiple applications. 

The Service Manager (SM) was developed to provide an unique point of contact 
with the dialogue domain(s). The SM can be modihed with the ignorance of the dia- 
logue manager, only reflecting the modihcations when a request is made. We achieved 
independence of the dialogue manager relative to the domains, and we can introduce new 
domains without having to modify the dialogue manager behavior. Currently, the sys- 
tem utterances template hie used by the the dialogue system is loaded in the Generation 
Manager module. Because this is an activity that envolves domain knowledge, the SM 
should also provide to the Dialogue Manager information about generation activities. 
This remains future work in the SM development. 

The system has been used in the “house of the future” to control three devices. This 
work made also use of speech recognition and an animated character. It is also being used 
on a demonstration system, accessible by telephone, where people can ask for weather 
conditions, stock hold information, bus trip information and cinema schedules. 

We expect to conduct a more formal evaluation of our system in order to quantify 
the user satisfaction. 
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Abstract. Search performance can be greatly improved by describing data 
using Natural Language Processing (NLP) tools to create new metadata and 
domain ontologies. A methodology is presented to use domain specific 
knowledge to improve user request. This knowledge is based on concepts, 
extracted from the document itself, used as “semantic metadata tags ” in order 
to annotate XML documents. We present the process followed to define and to 
add new XML semantic metadata into the digital library of scientific theses. 
Using these new metadata, an ontology is also constructed by following a 
methodology. Effective retrieval information is obtained by using an intelligent 
system based on XML semantic metadata and domain ontology. 



1 Introduction 

Internet has developed digital libraries that make available a great amount of digital 
information. Search engines work to provide this information to the user. Although 
there have been substantial advances to structure information, users must still evaluate 
the pertinence of documents presented by the web. Generally, to evaluate the 
pertinence, users read several fragments of the documents rather than the complete 
documents. It is fastidious to read and then to evaluate several entire documents, that 
is why many pertinent documents are always unknown by users. Our objective is to 
propose a solution to enable a better access to pertinent documents or fragments of 
documents in digital libraries. 

The project of INSA of Lyon called CITHER (consultation of entire text versions 
of theses) concerns the online publishing of scientific theses. We encountered the 
same difficulties to find pertinent information in the CITHER system as in other 
digital libraries. During a search session, it is impossible to extract the pertinent 
contents of several theses. To evaluate the pertinence of a thesis, users must read the 
entire document. Furthermore, a document may be too long for a quick evaluation. 

A promising way to solve this problem is to use metadata to “annotate" the 
documents and to describe their content in a better way. In our proposal, we have 
decided to extract the concepts that best describe the theses and to use them as 
metadata like “semantic tags". Of course, manual extraction of concepts is a time- 
consuming, so we use Natural Processing Language (NLP) tools for automating the 
extraction of concepts to overcome these limitations. 

Another promising way is to use an ontology based on concepts used as “semantic 
tags". An ontology is the description of concepts and their relationships. In our 
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approach, the construction of an ontology from digital theses is proposed by following 
a methodology. 

In our context, which is a digital library that publishes scientific theses, the 
addition of new semantic information into documents is intended to improve 
information retrieval. In order to insert new semantic information into digital theses, 
we have used a tool able to extract concepts from a given document. Section 3.1 
describes how we have chosen this tool, and then we present the annotation system. 

One of the causes of non-pertinent retrieval comes from the difficulty to describe 
with the appropriate words the users' needs. Therefore, queries are often too broad or 
too specific to cover relevant information. So, our approach is based on the search of 
annotated theses by using an ontology to expand user requests and to give the 
possibility to select between pertinent documents. The ontology is composed by the 
terms of a domain, which become, in our proposition, “semantic tags” used to 
annotate theses (Section 3). In addition, the ontology is composed by the 
identification of relationships between concepts. The identification of relationships 
among concepts and the methodology followed to construct our ontology are 
described in Section 4. This methodology is based on two steps: (1) the ontology 
capture step (Section 4.1) and (2) the ontology coding step (Section 4.2). In Section 5, 
we present the intelligent system designed to access to pertinent information by using 
our ontology. The conclusion and further research are proposed at the end. 



2 Background 

The terms are linguistic representations of concepts in a particular subject field [6]. 
Consequently, applications in automatic extraction of concepts, called terms in many 
cases, include specialized dictionary constructions, human and machine translations, 
indexing in books and digital libraries [6]. 

The ontology of the University Michigan Digital Library (UMDL) [23] delineates 
the process of publications using six formal concepts: “conception”, “expression”, 
“manifestation”, “materialization” , “digitization” and “instance”. Each of these 
concepts is related to other concept by using: “has”, “of”, “kind-of” or “extends” 
relationships. 

An ontology in the domain of the digital library is presented in [19]. This ontology 
named ScholOnto is an ontology-based digital library server to support scholarly 
interpretation and discourse. It enables researchers to describe and debate via a 
semantic network the contributions a document makes, and its relationship to the 
literature. As a result, by using this ontology, researchers will no longer need to make 
claims about the contributions of documents (e.g. “this is a new theory”, “this is a 
new model”, “this is a new notation”, etc), or contest its relationships to other 
documents and ideas (e.g. “it applies” , “it extends”, “it predicts” , “it refutes”, etc). 

Some of the methods used to specify ontologies in digital library projects include 
vocabularies and cataloguing codes such as Machine Readable Cataloguing (MARC). 
Other projects are based on the use of thesauri and classifications to describe different 
components of a document such as the title, the name of the author, etc. Thus, semantic 
relationships among concepts are defined by using broader and related terms 
in thesauri [4] . 
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3 Methodology Used to Annotate Theses 

In large document collections, such as digital libraries, it is very important to use 
mechanisms able to select the pertinent information. The use of “keywords" to 
represent documents is a promising way to manipulate information and to classify 
pertinent or non-pertinent documents contained in digital libraries. 

Annotation is the process of adding semantic markups to documents, but 
determining which concepts are tied to a document is not an easy task. To address this 
problem, several methods are proposed to extract concepts from a given document. In 
the field of NLP tools for the extraction of concepts we consider two main 
approaches: “key phrase assignment” and “key phrase extraction” [24]. 

By the term “key phrase ”, we mean a phrase composed by two or more words that 
describes, in a general way, the content of the document. “Key phrases” can be seen 
as “key concepts” able to classify documents into categories. 

A “key phrase assignment” uses a controlled vocabulary to select concepts or 
phrases that describes, in a general way, the document. Instead, the “key phrase 
extraction ” selects the concepts from the document itself. 

Our approach takes a document as input and generates automatically a list of 
concepts as output. This work could be called “key phrase generation” or “concept 
generation” . However, the NLP tool used in our work performs “concept extraction” 
which means that the extracted concepts always appear in the body of the input 
document. 

3.1 A Natural Language Processing Tool to Extract Concepts 

In order to choose one tool able to extract the higher number of pertinent concepts, we 
have analyzed four tools: (1) TerminologyExtractor of Chamblon Systems Inc., (2) 
Xerox Terminology Suite of Xerox, (3) Nomino of Nomino Technologies and (4) 
Copernic Summarizer of NRC. 

To evaluate the output list generated by each tool, we have compared each list with 
a list composed by concepts generated manually. The measure of performance and the 
method followed for scoring concepts, as well as the results that show that Nomino is 
the most interesting tool for our approach, are described in [1]. 

Therefore, in our work we use Nomino to automatically extract concepts from 
documents. Nomino is a search engine distributed by Nomino Technologies [16]. It 
adopts a morphosyntactic approach and uses a morphological analyzer that makes 
“stemming”. “Stemming”, means that the prefix and the suffix are removed to make a 
simple word. Nomino applies empirical criteria to filter the noise associated to the 
extracted concepts. These criteria include frequency and category, as well as stop lists. 

Nomino produces two types of interactive index containing all the concepts that most 
accurately summarize the content of a given document. One of the indexes created is 
very general. However, the other one contains concepts that are based on two principles: 
the “gain to express” and the “gain to reach”. The “gain to express” classifies the 
concepts according to their location in the given document. For example, if a paragraph 
is only concerned with one concept then this concept will be classified as important. The 
“gain to reach” classifies the concepts according to the frequency of appearance. If a 
word is very rare then it will be selected as important. For example, if in a given 
document we have the phrase “computer software” and the phrase “developing 
computer software”. Then the second phrase will be selected as the most important 
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because it is more complete for describing one activity. On the other hand, if the 
frequency of “computer software” is higher than “developing computer software”, then 
both phrases will appear in the concepts list created by Nomino. 

3.2 The Annotation Tool 

Since manual annotation can be time consuming and open to error, we have 
developed a tool to easily add metadata into documents hy making selections from 
one base of concepts. 

To use the concepts that were extracted hy the index of Nomino, we have proposed a 
tool to “annotate” documents [1]. The task consists in adding new metadata into the 
thesis while the PhD student is writing it. We do not consider the writing process using 
LaTeX since many of the theses found in the CITHER project were not written using 
LaTeX. The student adds the new metadata based on (1) the base of concepts, (2) the 
Nomino evaluation and (3) the personal tags. The new metadata are characterized by 
using a particular symbol. So, after the student has inserted the metadata and the thesis 
is completed the tool allows the identification of the semantic markups. Usually when 
the paragraph containing the symbol inserted (which contains the concept) is identified, 
it is embedded by a simple tag such as “ <concept-name>” and “</concept-name>” at 
the end. This annotation scheme allows us the management of Nomino concepts as well 
as the indexation and extraction of pertinent paragraphs from the document according to 
specific search criteria. During a search session, the system focuses on semantic 
markups, the XML tags, in order to retrieve the pertinent paragraph(s). 

The following features characterize our annotation tool: 

- The PhD student uses the annotation tool at anytime during the writing process of 
the thesis. 

- Nomino reads the selection of the student (parts of the thesis) and proposes 
concepts to the student. The student can accept, deny or complete the Nomino’ s 
proposition. 

- The concepts proposed to the user can be: concepts extracted from the document 
itself (by using Nomino) or concepts usually used in all the theses (e.g. “model”, 
“architecture”). 

- The validated concepts are integrated into the thesis as metadata tags. These new 
tags will be exploited during a search session. 

The next Figure, (Fig. 1) presents the general structure of our annotation tool. 

We have made another study in order to improve the annotation tool. In the next 
section (Section 4), we describe this work, which concerns the construction of an 
ontology able to expand the requests and categorize the documents. 



4 Methodology Used to Construct the Domain Ontology 

Gruber has defined an ontology as “an explicit specification of a conceptualization” . 
A conceptualization is defined by concepts and other entities that are presumed to 
exist in some area of interest and the relationships that hold among them [11]. An 
ontology in the artificial intelligence community means the construction of 
knowledge models [11], [13], [18], [20] which specify concepts, their attributes and 
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Fig. 1. General structure of the annotation tool 



inter-relationships. A knowledge model is a specification of a domain that focuses on 
concepts, relationships and reasoning steps characterizing the phenomenon under 
investigation. 

Our ontology is composed of two elements: the “domain concepts” and the 
“relationships" among them. The “domain concepts" are words or groups of words 
that are used to characterize a specific field. The “relationships" among these 
concepts are characterized by associative and hierarchic type. 

Two main approaches can be chosen when building an ontology. The first one 
relies on a “top-down method" . Someone may use an existing ontology and specify or 
generalize it to create another one. The second way to build an ontology is by using a 
“bottom-up method". This method consists in extracting from the appropriate 
documents all the concepts and relations among concepts to compose an ontology. 
We believe that this last method is accurate to our case because it does not exist yet an 
ontology tied to our domain. This method relies on two main steps: the (1) extraction 
of domain concepts (Section 4.1.1) and (2) the identification of relationships among 
these domain concepts (Section 4.1.2). 

Various methodologies exist to guide the theoretical approach chosen, and 
numerous tools for building an ontology are available. The problem is that these 
procedures have not merged into popular development styles or protocols, and tools 
have not yet matured to the degree one expects in other software practices. Examples 
of methodologies followed to build an ontology are described in [2], [12], [21]. In 
general, the following steps can define the methodology used to build our ontology: 
(1) the “ontology capture” step and (2) the “ontology coding” step. The “ontology 
capture” step consists in the identification of concepts and relationships. The 
"ontology coding" step consists in the definition of concepts and relationships in a 
formal language. These two steps will be described in the following paragraphs in 
order to present the construction of our ontology. 
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4.1 The Ontology Capture Step 

The ontology capture step consists in designing the overall conceptual structure of the 
domain. This will likely involve identifying the main concrete concepts of the domain 
(Section 4.1.1) and their properties, and identifying relationships among concepts 
(Section 4.1.2). 

4.1.1 Concept Extraction 

This section reports on our methodology used to define concepts that describe the 
content of theses. The backbone of our ontology is a hierarchy of concepts that have 
been extracted from the theses themselves. 

Concepts of the ontology are used to automatically categorize documents to allow 
a thematic access to documents. The problem of retrieving concepts and their 
structure comes from the use of tools able to retrieve concepts. As described in 
Section 2, we have used Nomino to extract our concepts. Given a document or a 
group of documents, Nomino constructs a specific index, which contains phrases 
composed of two or more words that are supposed to define the field. These concepts 
are called CNU (Complex Noun Units); they are series of structured terms composed 
by nominal groups or prepositional groups [8]. We used these CNU as a starting point 
to construct our ontology. 

The use of NLP tools, like Nomino, often produces “errors” that have to be 
corrected by a specialist of the domain. Some of these "errors" include phrases that 
are not really concepts or phrases that do not really describe the document. These 
“errors” are further described by [3]. The “errors” found in our work, by using 
Nomino, were generally about the kind of: (1) verbs frequently used (e.g. “called”), 
(2) abbreviations of names (e.g. “J.L. Morrison” .), (3) names of people, cities, etc., 
(e.g. “France community”), and also (4) phrases composed by CNU concepts which 
describe the actual situation of the document {“next chapter”, “next phase of the 
development”). The corpus used to construct the ontology was composed of scientific 
documents. At this time we have only worked with a sample composed by six theses 
of the computer science field. This corpus represents different applications of the 
computer science field. We have worked only with these six theses because we have 
noticed that using theses of the computer science field the augmentation of concepts is 
very little especially when we have new theses. For example, the Table 1 shows that 
there is a tendency to use the same concepts. For example, when we have 12 theses 
there are not new concepts. 

Although, in the Table 1 we have an augmentation of concepts only in the row of 7 
theses. This is because in our work we are focus in the concepts that, in general, often 
appear in almost all the theses. Also, Nomino does not considers the concepts that are 
used all the time in a document, this means that if a word is used very often is 
probably a “common” word and not a concept. Although, by using Nomino we have 
obtained 1,007 concepts. We obtained concepts like: “information research”, 
“information system”, “research system”, “remote training”, “abstract ontology”, 
“logical model” etc. However, some concepts are not pertinent and a human expert 
must evaluate them. After the evaluation we have selected only 688 concepts. 

The next step to construct the ontology is to define the relationships among 
concepts. In the next paragraph, we will describe the process used to find 
relationships by using Nomino’ s concepts as input. These concepts are proposed to 
the PhD student while the redaction process takes place. 
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Table 1. Comparison of the concepts extracted for the set of theses when adding a new thesis to 
the corpus. The numbers of the first column are the number of theses in which the concepts 
agree 
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4.1.2 IdentiUcation of Semantic Relationships 

With regard to the acquisition of semantic relationships, there are several approaches 
for acquiring semantic information. Once concepts have been retrieved, by using 
Nomino, they must be structured. One of the best-used techniques to discover 
relationships among concepts relies on the number of co-occurrences. This technique 
identifies concepts that often occur together in documents. 

There are different techniques used to identify relationships among concepts. 
These techniques are based on contexts of their co-occurrences. The idea is that two 
similar concepts do not necessarily occur together, but can occur in similar contexts. 

A first method based on this principle is described in [7]. This method defines a 
context and determines which concepts or terms, from a predefined list, often occurs 
in similar contexts. The terms from the predefined list are called “target words ” and 
the ones appearing in the same context are known as “context words”. Each “context 
word" is weighed according to the dependency that exists among the given target and 
with other context words. 

A second method relying on this idea of similarity between contexts of concepts’ 
occurrences is described in [9]. This method, which uses syntactical information, 
considers the concepts themselves, mainly noun phrases and linking words composing 
these concepts, based on the similarity of their syntactical contexts. 

In our approach, we used a NLP tool able to extract relationships among concepts. 
This tool is named LIKES (Linguistic and Knowledge Engineering Station) [17]. 
LIKES, based on statistical principles, is a computational linguistic station with 
certain functions able to build terminologies and ontologies. 

The concepts extracted by Nomino have been paired up in order to find relationships 
between them. Thus, we have manually paired all the concepts. These pairs have also 
been compared in the opposite way. Eor example, the pair “knowledge / language ” has 
also been evaluated as “language /knowledge”. Identifying relationships with LIKES 
takes a long time to process the corpus and to visualize the possible relationships. 
Furthermore, the relationships found have to be evaluated by a human expert. 
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LIKES allows the representation of relationships in order to find similar 
relationships in other pairs of concepts. Examples of phrases that contained some 
relationships between the concept pair “knowledge / language ” are the following (we 
have kept the same sentence structure in English as in French language): 

- Knowledge is represented by all the Languages’, 

- Knowledge is represented by some Languages’, 

- Knowledge is represented in all the Languages. 

We can notice in this example that the relation “Is represented” is in evidence. 

In the ontology-coding step we are going to explain how relationships identified by 
LIKES are used to model a formal ontology. 

4.2 The Ontology-Coding Step 

Uschold [22] defines “coding” as the explicit representation of the captured 
conceptualization in a formal language. 

In order to represent concepts and their relationships we have chosen the tool 
named Protege-2000. Protege-2000 is a knowledge-engineering tool that enables 
developers to create ontologies and knowledge bases [5], [10], [14], [15]. 

So, we have used Protege-2000 to model the concepts extracted by Nomino and the 
relationships among concepts identified by LIKES. Now, we have an ontology of the 
computer science field composed by 3194 concepts. 



5 Effective Retrieval Information 

Most of search engines use simple keywords connected by logical operators to make 
queries. In our context, we have developed a domain ontology and we have inserted 
new metadata into the digital theses. Nowadays, we are working on the 
implementation of a prototype able to improve the query process. This prototype will 
provide several improvements in information retrieval processes: 

- Terms used in the user’s request will be transformed into concepts; 

- It will be easier to navigate between concepts; 

- Information obtained will be more precise to the users’ needs. 

The terms used in the user’s request will be transformed by the use of the ontology. 
Often, when a user proposes a query, the terms are ambiguous and frequently 
different from those used in the documents. So, we are conceiving an intelligent 
system able to expand the users’ requests with the concepts contained in the 
documents. The intelligent system will propose new concepts or synonyms to expand 
the user’s query. These candidate concepts will be first selected from the ontology. In 
way to expand the users request, new concepts and logical operators are studied. 
Moreover, a thesaurus will be used to complete the ontology. User will accept or deny 
the system suggestion. 

Once the request is defined, the system will find pertinent information, included in 
the theses, by using semantic metadata. Then the paragraphs containing the pertinent 
information will be displayed. Users will be able to select the most pertinent 
paragraph(s) before reading the complete document(s). 
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The following features characterize our proposal for an intelligent access (Fig. 2): 

- User input will be done via an FITML page. 

- Parsing will be used to determine the sentence composition of the user’s request. 

- The ontology will allow the selection of new concepts to expand the request. 

- The new query will be sent to the database in order to retrieve fragments or 
complete theses. In this step, the query will be based on the semantic metadata 
contained in the theses (see Fig. 1). 

- The end user will get the results via the HTML page. 

- User will be able to retrieve the entire thesis if there exist satisfaction with the 
results. 




Fig. 2. General stmcture of the proposed model to expand the user query requests 

In general, given a search term, the ontology will propose closer terms to 
significantly enrich the request of the user. 



6 Conclusion and Further Research 

We have presented an approach to improve document retrieval by using NLP tools 
based on the semantic content of digital theses. Our approach has a double advantage: 
first, it can entirely exploit the content of digital theses by using semantic annotations 
and second, it can provide new alternatives to the users’ requests. 

Ontologies can be used to support the growth of a new kind of digital library, 
implemented as a distributed intelligent system [23]. In consequence, an ontology can 
be used to deduce characteristics of content being searched, and to identify appropriated 
and available operations in order to access or manipulate content in other ways. 

We have constructed an ontology by following a logical methodology. As long as 
there are no tools able to automatically construct ontologies from documents, the 
process carried out by using NLP tools will continue to require the help of human 
experts. The extraction of relationships by hand is very complex and with the use of 
NLP tools there are still some concepts whose relationships have to be instantiated by 
the human expert. 
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It is evident that there are still some needs in the ontology construction domain. 
The construction of our ontology is only the first step towards making available the 
pertinent information in the digital library. 

Currently, we are developing a prototype to access to the pertinent information. To 
retrieve this information, the prototype will be based on the new semantic metadata 
tags. These tags are the ones added by the PhD student while the writing process of 
the thesis takes place. The retrieval of information will be also based on the ontology 
concepts in order to expand query requests during a search session. Our prototype will 
propose different keywords to the user. Requests will be completed during the 
research of theses. 

Further research should investigate the use of dictionaries or thesauri in digital 
libraries to detect similar and non-identical concepts. The use of synonyms to 
complete our ontology could be another attempt. 
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Abstract. The lack of large, semantically annotated corpora is one of 
the main drawbacks of Word Sense Disambiguation systems. Unsuper- 
vised systems do not need such corpora and rely on the information of 
the WordNet ontology. In order to improve their performance, the use 
of other lexical resources need to be investigated. This paper describes 
the effort to integrate the Gonceptual Density approach with sources 
of lexical information different from WordNet, particularly the Word- 
Net Domains and the Gambridge Advanced Learner’s Dictionary. Un- 
fortnnately, enriching WordNet glosses with samples of another lexical 
resource did not provide the expected resnlts. 



1 Introduction 

The lack of large, semantically annotated corpora is one of the main drawbacks 
of supervised Word Sense Disambiguation( ) approaches. Our unsupervised 
approach does not need such corpora: it relies only on the WordNet ( ) lexical 

resource, and it is based on , and the frequency of WordNet 

senses [7]. Conceptual Density ( ^ is a measure of the correlation among the 

sense of a given word and its context. The foundation of this measure is the 
, defined as the length of the shortest path which connects 
two concepts in a hierarchical semantic net. 

Our approach gave good results, in terms of precision, for the disambigua- 
tion of nouns over SemCor (81.55% with a context window of only two nouns, 
compared with the MFU-baseline of 75.55%), and in the recent all- words task 
in the Senseval-3 (73.40%, compared with the MFU-baseline of 69.08%) [2]. Un- 
fortunately, although the precision achieved by our system is above that of the 
baseline, we still need to improve the recall, since there are nouns, whose senses 
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are close in meaning, that are left undisambiguated by our system. We inves- 
tigated the use of other lexical resources, the ^ [5] and the 

Cambridge Advanced Learner’s Dictionary^ ( . ) to improve our approach. 



2 Combining Conceptual Density and Frequency 



In our approach the noun sense disambiguation is carried out by means of the 
formula presented in [7]. This formula has been derived from the original Con- 
ceptual Density formula described in [1]: 



CD{c, m) 



E m — 1 ; 7 

nhyp^ 
J2i=o nhyp^ 



( 1 ) 



where c is the synset at the top of subhierarchy, m the number of word senses 
falling within a subhierarchy, h the height of the subhierarchy, and nhyp the 
averaged number of hyponyms for each node (synset) in the subhierarchy. The 
numerator expresses the expected area for a subhierarchy containing m marks 
(word senses), whereas the divisor is the actual area. 

Due to the fact that the averaged number of hyponyms for each node in 
WN2.0 (the version we used) is greater than in WN1.4 (the version which was 
used originally by Agirre and Rigau), we decided to consider only the 
part of the subhierarchy determined by the synset paths (from c to an ending 
node) of the senses of both the word to be disambiguated and its context, and 
not the portion of subhierarchy constituted by the synsets that do not belong 
to the synset paths. The base formula takes into account the M number of 
relevant synsets, corresponding to the m in Formula 1 (|M| = |to|, even if 

we determine the subhierarchies before adding such marks instead of vice versa 
like in [1]), divided by the total number nh of synsets of the subhierarchy. 



baseCD{M,nh) = M/nh 



(2) 



The original formula and the above one do not take into account sense fre- 
cuency. It is possible that both formulas select subhierarchies with a low fre- 
quency related sense. In some cases this would be a wrong election. This pushed 
us to modify the CD formula by including also the information about frequency 
contained in WN: 

CD{M,nhJ) = (3) 

where M is the number of relevant synsets, a is a constant (the best results 
were obtained over the SemCor corpus with a near to 0.10), and / is an integer 
representing the frequency of the subhierarchy-related sense in WN (1 means the 
most frequent, 2 the second most frequent, etc.). This means that the first sense 
of the word (i.e., the most frequent) gets at least a density of 1 and one of the 
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less frequent senses will be chosen only if it will exceed the density of the first 
sense. The M“ factor was introduced to give more weigth to those subhierarchies 
with a greater number of relevant synsets, when the same density is obtained 
among many subhierarchies. 




Fig. 1. Subhierarchies resulting from the disambiguation of brake with the context 
words {horn, man, second}. Example extracted from the Senseval-3 english-all- words 
test corpus (Mi = 9, nhi = 21, M 2 = M 3 = n /12 = nhs = 1 M 4 , = 1, n/14 = 5, where 
Mi and nhi indicates, respectively, the M and nh values for the i-th sense) 

In Figure 1 are shown the resulting WordNet subhierarchies from the dis- 
ambiguation of with the context words { , , } from the 

sentence: “Brakes howled and a horn blared furiously, but the man would have 
been hit if Phil hadn’t called out to him a second before” , extracted from the 
all- words test corpus of Senseval-3. The areas of subhierarchies are drawn with 
a dashed background, the root of subhierarchies are the darker nodes, while the 
nodes corresponding to the synsets of the word to disambiguate and those of 
the context words are drawn with a thicker border. Four subhierarchies have 
been identified, one for each sense of . The senses of the context words 
falling outside of these subhierarchies are not taken into account. The resulting 
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CDs are, for each subhierarchy, respectively: 9°'^° * (9/21)*°®^ = 1.27, 1, 1 and 
fO io * (l/5)^°s^ = 0.07; therefore, the first one is selected and the first sense is 
assigned to . . 

Nouns are left undisambiguated when different senses are close in meaning, 
like the senses 2 and 3 of in the previous example, and no context senses 
fall into the subhierachies. In this case, the maximum density is the same (1) 
for more senses, therefore, we cannot assign a sense to the word. Initially, we 
investigated the opportunity of assigning the most frequent of those senses, but 
it gave no significant advantages with respect to selecting always the first sense. 
Consequently, we decided to integrate our approach with other resources, in 
order to retrieve more useful informations for the disambiguation of words. 



3 Experiments with WordNet Domains 

An useful information to be considered into the disambiguation process is the 
domain of words: e.g. it is more probable to find . ( ) (’. ’) 

with the context word , than ^ ( ) {’ ’), because both . ( ) 

and , concern the domain of “economy” . We observed that in the version 
2.0 of WordNet three ’domain’ relationships have been introduced: , , 

and . Unfortunately, they are defined only for a small number of 

synsets: respectively 3643 (4.5% of the total number of noun synsets), 653 (0.8%) 
and 1146 (1.4%). Due to the fact that the WordNet Domains resource provides 
a wider coverage of synsets, we carried out some experiments to see if we could 
exploit both the new relationships and the WordNet Domains. 

We performed some tests over the SemCor corpus, disambiguating all the 
words by assigning them the sense corresponding to the synset whose domain 
is matched by the majority of context words’ domains (e.g. with context 
words . , . , is assigned the first sense) . We tried different 

size of the context window. The results are summarized in Figure 2. 

The very low recall, even with large context windows, and the smaller preci- 
sion, obtained by using the domains relationship in WN2.0 with respect to the 
WordNet Domains resource, suggested us to rely only on the latter for further 
experiments. Since WordNet Domains has been developed on the version 1.6 
of WordNet, it has been necessary to map the synsets from the older version 
to the the last version. This has been done in a fully automated way, by using 
the WordNet mappings for nouns and verbs, and by checking the similarity of 
synset terms and glosses for adjectives and adverbs. Some domains have also 
been assigned by hand in some cases, when necessary. 

Thereafter, additional weights ( , ) have been 

added to the densities of the subhierarchies corresponding to those senses having 
the same domain of context nouns’ senses. Each weight is proportional to the 
frequency of such senses, and is calculated in the following way: 

Jo if Dom{wf) yf Dom{cij) 

(l//*l/j if Dom{wf) = Dom{cij) 



MDW {wf, Cij) 



(4) 




Integrating Conceptual Density with WordNet Domains 187 



55 




2 4 6 12 24 2 4 6 12 24 

2.0 2.0 2.0 2.0 2.0 wnd wnd wnd wnd wnd 



Models 

Fig. 2. Confrontation between WN2.0 domains relationships (2.0 columns) and Word- 
Net domains (wnd columns) with different window sizes over polysemic nouns in the 
SemCor corpus. Precision and recall are given as percentages 



where / is an integer representing the frequency of the sense of the word to be 
disambiguated, j gives the same information for the f-th context word, Dom{x) : 
Synsets — > Domains is the function returning the domain(s) corresponding to 
synset x,Wf and are, respectively, the synsets corresponding to the /-th sense 
of the word to be disambiguated and the j-th sense of the i-th context word. 

E.g. if the word to be disambiguated (w) is , we obtain Dom{wi) = 

“Medicine’^ and Dom{wi) = “SchooF . Therefore, if the context word (ci) is 
. , for which Dom{ca) = “School", the resulting weight for ( ) 

and . ( ^ is 1/4 * 1/3. Therefore, after the inclusion of MDWs, the for- 

mula (3) becomes as follows: 

|C| k 

CD{M,nh,w,f, C) = M“(M/n/i)^°8^ -h EE MDW{wf,a,) (5) 

i=0 j = l 

where C is the vector of context words, k is the number of senses of the context 
word Ci, and Cij is the synset corresponding to the j-th sense of the context word 
a. 

With the introduction of MDWs, however, we did not obtain the desired im- 
provements (70.79% in precision and 67.91% for recall, below the MFU baseline 
of 75.5% for both measures). The reason is that many of the correspondances 
in domains are found for the domain , that is too generic and, con- 

sequently, it does not provide any useful information about the correlation of 
two word senses. Our solution was to reduce by a 10 factor the relevance of the 
domain, with the formula(4) modified as follows: 
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0 if Dom{wf) ^ Dom{cij) 

l/f * 1/j if Dom\wf) = Dorn\cij) 

M DW {w f , Cij) = < A Dom{wf) ^ “Factotum^^ (6) 

10“^ * (1// * 1/j) if Dom{wf) = Dom{cij) 

A Dom{wf) = “Factotum” 

In this way we could obtain a precision of 78.33% and a recall of 62.60% 
over the whole SemCor, with a context window of 4 nouns. We also tried not to 
take into account the domain. In this case we got an higher precision 

(80.70%), but the recall was only 59.08%. This means that whereas the 
domain does not provide useful information to the disambiguation task, it can 
help in disambiguating a certain number of nouns with the most frequent sense, 
thanks to the weights assigned proportionally to the frequency of senses. 

We used nearly the same method to disambiguate words of POS categories 
other than nouns. In these cases we could not take into account the Conceptual 
Density. For the following reasons: first of all, in WordNet there is not a hierarchy 
for adjectives and adverbs. With regard to verbs, the hierarchy is too shallow to 
be used efficiently. Moreover, since the disambiguation is performed one sentence 
at a time, in most cases only one verb for each sentence can be found (with the 
consequence that no density can be computed) . 

The sense disambiguation of an adjective is performed only on the basis of 
the domain weights and the context, constituted by the / /, i-e., 

the noun the adjective is referring to (e.g. in , the 

CN of is ). Given one of its senses, we extract the synsets 

obtained by the - - , - , - - and relationships. 

For each of them, we calculate the MDW with respect to the senses of the context 
noun. The weight assigned to the adjective sense is the average between these 
MDWs. The selected sense is the one having the maximum average weight. 

In order to achieve the maximum coverage, the domain has been 

also taken into account to calculate the MDWs between adjective senses and 
context noun senses. However, due to the fact that in many cases this domain 
does not provide a useful information, the weights resulting from a 
domain are reduced by a 0.1 factor. E.g. suppose to disambiguate the adjective 
referring to the noun . Both ( ) and / ) belong to 

the domain . Furthermore, the domain contains the senses 1 4 

and 7 of , and senses 2 and 3 of . The extra synsets obtained by 

means of the WN relationships are: () , , pertainym of sense 1; 

( ) and ( ) , similar and antonym of sense 

2; . ( ) and . ( ) , similar and antonym of 

sense 3. Since there are no senses of in the . domain, ( ) 

is not taken into account. Therefore, the resulting weights for are: 

1 * 1/6 = 0.16 for sense 1; 

0.1 * (1/2+ 1/2* 1/4 + 1/2* 1/7+ [1/2* 1/3+ 1/2* l/2])/5 ~ 0.02 for sense 2; 
0.1 * (1/3 + 1/3 * 1/4 + 1/3 * 1/7 + [1/3 * 1 + 1/3 * l])/5 ~ 0.02 for sense 3. 
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The weights resulting from the extra synsets are represented within square 
brackets. Since the maximum weight is obtained for the first sense, this is the 
sense assigned to 

We tried to use the same idea to improve the results for the noun sense 
disambiguation task: if in the sentence 

has a closer link to than to , , it is also true that 

has a close link to . This kind of link is also easy to be found, since in 

English adjectives always come before the noun they are referring to. Therefore, 
we investigated the utility of choosing only the (CA) in the 

context for calculating MDWs for nouns, in a similar way to what we did for 
adjectives. Our experiments show that the precision and the recall values differ 
slightly from the base results obtained without domains. 



Table 1. Comparison among approaches to noun sense disambiguation making use of 
WordNet Domains. The results are obtained over the whole SemCor with a context 
window size of 4 words (in all cases but the CA and the MFU ones) 





MFU 


no WND 


WND 


WND (CA) 


Precision 


75.55% 


80.70% 


78.33% 


80.45% 


Recall 


75.55% 


59.07% 


62.60% 


59.42% 


Coverage 


100% 


73.20% 


79.91% 


73.86% 



The same experiments carried out over the Senseval-3 corpus showed a more 
significative difference between the CA technique and the other ones. Moreover, 
in this case the precision is even higher than the one obtained without taking 
into account the WordNet domains. 



Table 2. Comparison among approaches to noun sense disambiguation making use 
of WordNet Domains. The results are obtained over the Senseval-3 All- Words Task 
corpus with a context window size of 4 words (in all cases but the CA and the MFU 
ones) 





MFU 


no WND 


WND 


WND (CA) 


Precision 


69.08% 


73.40% 


65.39% 


74.30% 


Recall 


69.08% 


51.81% 


58.28% 


52.69% 


Coverage 


100% 


70.58% 


89.13% 


70.91% 



We tried to limit the search of the closest adjective for the noun only to the 
immediately preceding word or to the two preceding words, but results differ 
only of a 0.1 - 0.2% (Tables 3 and 4) from those obtained without doing such 
distinction. 

The results are very similar, since the approaches differ for a few hundreds 
nouns, as it can be observed from the coverage values (the corpus is made up of 
more than 70000 nouns). 
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Table 3. Comparison among approaches to noun sense disambiguation searching back- 
wards in context for the closest adjective without restrictions, only within 2 words 
before the noun, and just the word before. The results are obtained over the whole 
SemCor 





unrestricted 


2 words 


1 word 


Precision 


80.45% 


80.55% 


80.57% 


Recall 


59.42% 


59.32% 


59.27% 


Coverage 


73.86% 


73.64% 


73.56% 



Table 4. Comparison among approaches to noun sense disambiguation searching back- 
wards in context for the closest adjective without restrictions, only within 2 words 
before the noun, and only the word before. The results are obtained over the whole 
Senseval-3 All- Words Task corpus 





unrestricted 


2 words 


1 word 


Precision 


74.30% 


74.41% 


74.49% 


Recall 


52.69% 


52.57% 


52.68% 


Coverage 


70.91% 


70.58% 


70.80% 



The results obtained over the Senseval-3 All- Words Task corpus confirm that 
the precision of the CA approach can be slightly improved by considering adjec- 
tives with a stronger tie (i.e., closer) to the noun to disambiguate. 

Another approach we evaluated was to add weights to subhierarchies only 
when nouns were left undisambiguated by the use of the “clean” CD formula 
(3). In other words, to include the sum in formula (5) only when CD and fre- 
quency are not able to disambiguate the noun. Unfortunately, the results are 
approximately the same of those in Tables 3 and 4, and, therefore, they are not 
worth mentioning. 

The sense disambiguation of a verb is done nearly in the same way than for 
adjectives, but taking into consideration only the MDWs with the verb’s senses 
and the context words (i.e., in the example above, if we had to disambiguate 
a verb instead of an adjective, the weights within the square brackets would 
not have been considered) . In the Senseval-3 all- words and gloss disambiguation 
tasks the two context words were the noun before and after the verb, whereas 
in the lexical sample task the context words were four (two before and two after 
the verb), without regard to their POS category. This has been done in order 
to improve the recall in the latter task, whose test corpus is made up mostly by 
verbs, since our experiments carried out over the SemCor corpus showed that 
considering only the noun preceding and following the verb allows for achieving 
a better precision, while the recall is higher when the 4-word context is used. 
However, our results over verbs are still far from the most-frequent baseline. The 
sense disambiguation of adverbs (in every task) is carried out in the same way 
of the disambiguation of verbs for the lexical sample task. 
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4 Experiments with Glosses 

Glosses have been used in the past as a resource for Word Sense Disambiguation 
by Lesk[4] and many other researchers. Usually WordNet glosses are composed 
of two parts: 



part; 

part. 

E.g. the gloss of J is: . 

; the definition part in this case is 

, while the sample part is 



We carried out some experiments over the WordNet glosses in order to un- 
derstand which of these portions is more important to the task of Word Sense 
Disambiguation. Initially, we defined ( ) similarly to MDWs. 

Each GW is calculated as follows: 



GW{wf, Ci) 



0 if Ci ^ Gl{wf) 

0.3 if Ci G Gl{wf ) 



(7) 



where Ci is the i-th word of the context, Wf is the /-th sense of the word to be 
disambiguated, and Gl{x) is the function returning the set of words being in the 
gloss of the synset x without stopwords. E.g. (light-; ) { 

}• 

The GWs are added to the formula(3) as an alternative to MDWs: 



|C| 

GD{M,nh,wJ,G) = + J2GW{wf,a) (8) 

i=0 

where G is the vector of context words, and C; is the i-th word of the context. 

We initially used a weight of 0.5, considering as a good matching the fact 
that two context words were found in a gloss, but subsequently we obtained 
better results with a weight of 0.3 (i.e., at least three context words are needed 
to be found in a gloss to obtain a density close to 1). Two other Gloss Weights 
were defined, each making use of a different Gl function: GWd, which, by using 
Gld{x), returns the set of words in the definition part of the gloss of the synset 
X] and GWs, which uses Gls{x) to return the set of words in the sample part. 

The obtained results show that disambiguation carried out by considering 
only the sample portion of WordNet glosses is more precise than working on the 
whole gloss and/or the definitions. This has been observed also in the experi- 
ments conducted over the Senseval-3 All- Words Task corpus (Table 5). Therefore, 
we looked for another machine-readable resource in order to expand the sam- 
ple portions of WordNet glosses. We decided to use the Gambridge Advanced 
Learner’s Dictionary (GALD), since it is one of the few available on-line, and its 
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Table 5. Results obtained over the whole SemCor and Senseval-3 All- Words Task 
corpora, with a window size of 4 nouns, by using whole glosses or their separate parts, 
and Gloss Weights of 0.3 



SemCor 


precision 


recall 


Whole gloss 


79.31 


60.22 


Definition only (GWd) 


79.85 


59.52 


Samples only {GWs) 


80.12 


59.96 


Senseval-3 AWT 






Whole gloss 


73.75 


52.14 


Definition only (GWd) 


73.98 


51.81 


Samples only (GWs) 


74.06 


52.03 



HTML pages are organized in a format that allows to easily retrieve information 
about the POS of a word, its definition, and the sample parts of glosses. 

The heuristics we use to retrieve a CALD gloss corresponding to a WordNet’s 
one compares its synset terms and the definition parts of the gloss. For each 
synset in WordNet, we search in CALD the synset’s words, and then select the 
resulting entries as candidate glosses. If more than the 40% of the definition part 
of the WordNet gloss is found in one of the CALD definition parts of candidate 
glosses, then the corresponding sample part is added to the WordNet gloss. 

E.g. for the WordNet synset 

( . ), we search in the CALD web page 

for: , , , , , obtaining respectively 1, 0 and 

1 entries (one is missing since the CALD returns the same entry for 
and ) . A matching greater than 40% is found only within the CALD 

definition of : “the quality of cohering or being coherent” . Since this 

definition shares 4 words (over 7) with the WN gloss (the, of, cohering, or), the 
resulting matching is 4/7 = 54%, with the result that the following sample sen- 
tence is added to the WordNet gloss: “There was no coherence between the first 
and the second half of the film” . A drawback of this heuristics is that stopwords 
are taken into account in the calculation of similarity between glosses. This may 
be useful when they keep the same order and position in both definitions, like 
in the previous example, but in most cases they are actually adding noise into 
the disambiguation process. Therefore, we will need to determine a better way 
to check the matching of gloss definitions. 

In WordNet 2.0 there are 8195 noun glosses with samples. With our heuristics, 
we found 7416 totally new sample sentences, raising the total number of sample 
sentences from 8195 to 15611. Moreover, new sample sentences were added to 
2483 already existing samples. We used these “expanded” glosses to carry out 
some Word Sense Disambiguation tests over the SemCor and the Senseval-3 
All- Words Task corpora. The results are shown in Table 6. 

The experiments have been carried out initially looking for all the context 
words in the expanded gloss of the word to be disambiguated. Thereafter, due 
to the poor results obtained, we used only those samples which contained its 
context nouns. Finally, we tried to select the gloss in a more precise way by 
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Table 6. Results obtained using the glosses expanded with CALD samples over the 
whole SemCor and Senseval-3 AWT corpora. The window size was of 4 words 



SemCor 


precision 


recall 


All context words 


73.84 


67.07 


Only context nouns 


79.78 


59.76 


Context nouns and 






word to be disambiguated 


78.97 


59.58 


Senseval-3 AWT 






Only context nouns 


64.72 


48.73 



using only those containing the word to be disambiguated itself together with the 
context nouns. With respect to the Senseval-3 All- Words Task, we performed the 
test only with the context nouns, which gave the best results over the SemCor. 
Whereas for the SemCor corpus we obtained results comparable with those in 
Tables, for the Senseval-3 AWT we observed a precision decrease of about 10%. 

The reason of these poor results is that many of the added glosses were “off- 
topic”, like: 

, that has been added for the synset , ,( 

). Therefore, we tried to improve the quality of the added samples by 
setting an higher threshold (70%) for the matching of the WordNet and CALD 
definitions, and by selecting only those glosses sharing at least a (non stopword) 
lemma with the deflation part of the gloss to be expanded. In this case, only 264 
sample sentences were added, a number not relevant with respect to the total 
number of gloss samples in WordNet (8195 samples over 79688 glosses). Even if a 
more precise study over the threshold parameter could be done, we suspect that 
it could be really difficult to select gloss samples from different lexical resources, 
since in each resource can be different the way definitions are given. Therefore, 
the heuristics we used, inspired by the one used for the mapping of different 
versions of WordNet, cannot be applied between different lexical resources. 

5 Conclusions and Further Work 

The experiments with the WordNet Domains show that using this resource allows 
for improving recall without losing too much in precision, although the conditions 
when this can be done are very few. This is mostly due to the small number of 
correspondances that can be found for domains different than “Factotum” . We 
observed that a better precision can be obtained for the gloss approach if we 
consider only the sample part of WordNet glosses. Therefore, we tried to add 
further gloss samples from the Cambridge Advanced Learner’s online Dictionary. 
However, due to the poor results obtained, we decided not to integrate in our 
approach also the CALD glosses, until we will not be able to add gloss samples in 
an appropriate way. Maybe it could be worthwhile to investigate the possibility 
of selecting the gloss samples by disambiguating glosses and And a matching 
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between concepts. The use of other resources such as the Roget’s Thesaurus 
or the Oxford Advanced Learner’s Dictionary will be also investigated. At the 
moment, we are also investigating the possibility to use the web as a knowledge 
source for WSD [3], by using an approach inspired by [6]. 
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Abstract. In this paper, we present a robust incremental architecture 
for natural language processing centered around syntactic analysis but 
allowing at the same time the description of specialized modnles, like 
named entity recognition. We show that the flexibility of our approach 
allows us to intertwine general and specific processing, which has a 
mntual improvement effect on their respective results: for example, 
syntactic analysis clearly benefits from named entity recognition as a 
pre-processing step, but named entity recognition can also take advan- 
tage of deep syntactic information. 



1 Introduction 

The robust system presented in this article performs deep syntactic analysis as- 
sociated with the detection and categorization of named entities that are present 
in texts. This system is robust as it takes any kind of text as input and always 
gives an output in a short time (about 2000 words/second). At the same time, 
we show that robustness is not synonymous with shallowness, as our system is 
able to handle fine-grained syntactic phenomena (like control and raising). Fur- 
thermore, our system is flexible enough to enable the integration of specialized 
modules, as we did for named entity recognition. We first describe our system, 
then focus on the entity recognition module we developed and show how it is 
integrated in the general processing chain. We then give some examples of the 
benefit of having these two modules developed together: syntactic analysis ben- 
efits from named entity recognition and the task of named entity recognition 
benefits from a fine-grained syntactic analysis. Finally we conclude by giving 
some hints of our future work. 



2 Description of Our System 

2.1 Robust and Deep Syntactic Analysis Using XIP 

XIP (Xerox Incremental Parser) (see Ait et al.[2]) is the tool we use to perform 
robust and deep syntactic analysis. Deep syntactic analysis consists for us in the 
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construction of a set of syntactic relations from an input text. Although depen- 
dency grammars (see Mel’cuk [12] and Tesniere [16]) inspired us, we prefer calling 
the syntactic output of our system syntactic relations as we do not obey prin- 
ciples like projectivity and we take liberties with the above-mentioned syntactic 
paradigms. These relations^ link lexical units of the input text and/or more 
complex syntactic domains that are constructed during the processing (mainly 
chunks, see Abney [1])). These relations are labelled, when possible, with deep 
syntactic functions. More precisely, we try to link a predicate (verbal or nomi- 
nal) with what we call its deep subject, its deep object, and modifiers. When the 
deep subjects and deep objects are not found, the general syntactic relations are 
still available. For instance, for the sentence 

. , , the 

parser produces the following relations: 

DETD(law,The) MOD_POST_INFINIT (impossible, locate) 

M0D_PRE (law, escheat) MDD_PRE( impossible, almost) 

NUCL_VL1NK_M0DAL (cannot, be) EMBED_lNFlNlT(locate , is) 

NUCL_VL1NK_PASS1VE (be , enforced) OB J-N (locate , property) 

OBJ-N (enforced, law) MDD_PRE(property , such) 

TIME (enforced, now) SUBJ-N (declared, Daniel) 

EMBED (is , enforced) MAIN (declared) 

NUCL_SUBJCOMPL(is, impossible) SUBJ-N (is , it) 

It is important to notice that, in this example, the passive form 

has been recognized as such and then normalized, as we 
obtain the relation ( , ). 

We now briefly explain below how these relations are obtained. 

XIP and General Syntactic Analysis. XIP is a tool that integrates different 
steps of NLP, namely: tokenization, POS tagging (combination of HMM and 
hand-made rules), chunking and the extraction of syntactic relations. Chunking 
is not compulsory for the syntactic relation extraction, but we decided to apply 
this first stage of processing in order to find the boundaries of non-recursive 
phrases. This preliminary analysis will then facilitate the latter processing stage 
(See Giguet’s work [6] for more detailed indications of the interest of finding 
chunks in order to ease the extraction of dependencies). 

A chunking rule can be expressed in two ways: 

~ by which define a list of categories; 

~ by / ) defining sets of categories that are com- 

bined with LP (linear precedence) constraints. 

In both cases, contextual information can be given. 

For instance, the following sequence rule (in which no specific context is 
given) expresses that a Nominal Chunk (NP) starts with a lexical unit bearing 

^ We consider binary and more generally n-ary relations. 
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the feature det:+ (i.e. is a determiner), can be followed by 0 or more adjectives 
which are followed by a lexical unit having the feature noun:+ 

NP = ?[det:+], (adj)*, ?[noun:+] . 

This rule could also have been expressed by the following ID rule and LP 
constraints^ . 

NP -> ?[noun:+], ?[det:+], (adj)* . 

[det:+] < [adj:+] 

[ad j : +] < [noun : +] 

In our approach, after chunking is performed, the system calculates syntactic 
relations through what we call deduction rules. These rules apply on a chunk 
tree (that can be completely flat if no chunks have been previously calculated) 
and consist in three parts: context, condition and extraction. 

Context is a regular expression on chunk tree nodes that has to match with 
a syntactic construction. 

Condition is a boolean condition on dependencies, on linear order between 
nodes of the chunk tree, or on a comparison of features associated with nodes. 

Extraction corresponds to a list of dependencies to be created if the con- 
textual description matches and the condition is true. 

For instance, the following rule establishes a relation between the head 

of a nominal chunk and a finite verb: 

I NP{?* ,#1 [last : +] } , ?*[verb:~], VP{?*, #2[last:+]}| 

if (~SUBJ(#2,#D) 

SUBJ(#2,#1) . 

The first line of the rule correspond to context and describe a nominal chunk 
in which the last element is assigned to the variable #1, followed by any thing 
but a verb, followed by a verbal chunk in which the last element is assigned to 
the variable #2. The second line checks wether a relation exists between 

the lexical nodes corresponding to the variable #2 (the verb) and #1 (the head 
of the nominal chunk). The test is true if the relation does not exist. If 

both context and condition are verified, then a relation is created between 

the verb and the noun (last line). 

An important feature is that our parser always provides a unique analysis (it 
is deterministic), this analysis being potentially underspecified. 

XIP and Deep Syntactic Analysis. Together with surface syntactic rela- 
tions handled by a general English grammar, we calculate more sophisticated 
and complex relations using derivational morphology properties, deep syntac- 
tic properties (subject and object of infinitives in the context of control verbs), 
and some limited lexical semantic coding (Levin’s verb class alternations) . These 
deep syntactic relations correspond roughly to the agent-experiencer roles that 



^ The symbol < expresses linear precedence. 
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is subsumed by the relation and to the patient-theme role subsumed by 

the relation. Not only verbs bear these relations but also deverbal nouns 

with their corresponding arguments (for more details on deep syntactic analysis 
using XIP see Hagge and Roux [7]). 

For instance the following rule establishes that the surface subject of a verb 
in passive form is in fact the deep object of this verb while the surface object of 
this verb corresponds to the deep subject. 

if ( SUBJ (#1 [passive :+], #2) & 0BJ(#1,#3) ) 

SUBJ-N(#1,#3) , 

0BJ-N(#1,#2) . 

At the end of the deep syntactic analysis stage, deep syntactic relations to- 
gether with surface syntactic relations are available. 

2.2 A Basic System for Named Entity Categorization 

Named entity recognition and categorization is a fundamental task for a wide 
variety of natural language processing applications, such as question answering, 
information management, text mining and business intelligence, lexical acquisi- 
tion, etc. Therefore, the NLP community shows a great interest concerning this 
issue. For example, the MUC conferences defined a task of named entity recogni- 
tion using annotated corpora, and enabled the comparison of different methods 
for the task (see Sekine and Eryguchi[14] for an interesting state of the art of 
the different methodologies, and Poibeau [13] for an analysis of the evaluation 
criteria) . More recently, the ACE ^ project (Automatic Content Extraction) has 
a specific task concerning named entities as well. 

This task is also a useful step towards achieving fined-grained syntactic and 
semantic analysis. For these reasons, it seemed useful to integrate such function- 
ality into the XIP parser. Moreover, the overall parsing process should benefit 
from the integration of this module. 

The system we built for named entity categorization focuses on the following 
predefined classes: 

— percentages, e.g. %, 

~ dates, e.g. , , , , and temporal expressions, e.g. 

— expressions denoting an amount of money, e.g. $ . 

— locations, e.g. , , 

— person names e.g. , , 

— organizations e.g. , , , & , , 

— events e.g. , , ’ 

— legal documents , 

This list is non-exhaustive, but corresponds to the most common types of 
entities generally recognized by dedicated systems. This “basic” system is built 



® http:/ /www. ldc.upenn.edu/Projects/ACE/intro.html 
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within the XIP parser presented above, on top of a part-of-speech tagger. This 
system is purely rule-based. It consists in a set of ordered local rules that use 
lexical information combined with contextual information about part-of-speech, 
lemma forms and lexical features. These rules detect the sequence of words 
involved in the entity and assign a feature (loc, org, etc.) to the top node of the 
sequence, which is a noun in most of the cases. In the incremental parsing process, 
these rules are applied in the pre-syntactic component, before the chunking and 
dependency rules, therefore no syntax is used at this stage. For example, the 
following rule is used to detect and categorize organization names consisting in 
a sequence of nouns starting with a capital letter (feature ) and finishing with 
a typical organisation marker like , , , etc. These markers bear the 

feature within the lexicon. 

noun[org=+] -> noun+ [cap=+] ,noun [orgEnd=+] 

The rule enables the detection of the organization name in the following 
example: 

< > < > ^ 

At this stage of the processing, one can already use contextual information 
in these rules. This is illustrated on the following rule: 

noun [person=+] = noun[cap=+] Ipart, noun[famtie=+] I 

This rule transforms (overriding sign ) a noun starting with a capital letter 
in a person name, when it is followed by an element of category part (’ ) and 
by a noun bearing the feature , like , , etc. It enables the 

detection of the person name in the following example, 

. < > < >’ 

These rules are combined with a propagation system integrated into the 
parser, that allows subparts of the entities to be marked, and then allows new 
occurrences of these subparts to be categorized when encountered later on in the 
text. For example if the word , which is ambiguous between a person 

name and a city name, is encountered in some part of the text in the string 
, then the system tags all remaining occurrences of 
as a person, using feature propagation on this entity part. This functionality 
is also very useful when proper nouns are truncated, which is very common for 
(business) organisation names: 

< > < > 

< > < > 



At the current stage of development, the basic system contains about 300 
local grammar rules for entity detection. 

Since the entity recognition system is embedded in a syntactic parser, the 
corresponding rules have been built in order to maintain a high precision, which 
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attempt to prevent any deterioration of the parsing results. We have conducted a 
preliminary evaluation on a short corpus (about 500 sentences) from the Reuters 
news agency, on the location, person, and organisation named entities. It led to 
90.2% precision and 75.5% recall. 

2.3 Complete Architecture 

The complete architecture of the parsing system, including entity processing is 
shown on figure 1. Parsing and named entity categorization can interact one 
with another (bold arrows). 



INPUT TEXT 




ENTITY DEPENDENCIES SYNTACTIC DEPENDENCIES 

Fig. 1. Architecture of the System 



3 Parsing and Entity Categorization as Interleaved 
Processes 

3.1 Syntax and Entities 

As presented before, our entity recognition system is embedded in our syntactic 
parser. Apart from being a useful task by itself, entity recognition improves the 
overall quality of the parsing output. 

Entities for Better Tokenization. The first straightforward improvement 
concerns tokenization problems. Treating numbers, dates, monetary expressions, 
etc., as syntactic units avoids many word segmentation problems, since these 
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expressions are often formed with punctuation signs that could be misinterpreted 
at a deeper syntactic level. Consider the following sentence: 



Analysing , , as a unit (organization), allows the nominal apposition 

of to be delimited properly: the interpretation of the commas 

needs to be different when it marks the boundaries of a nominal apposition or 
when it is employed within an organisation name. This is what is reflected by 
our incremental multi-layer analysis. 

Entities and POS Tagging. Still in pre-syntactic processing, at the level of 
POS disambiguation, the previous detection of entities can enable to avoid POS 
tagging errors that will have then consequences in the rest of processing. Take 
for instance the following example 

, , . Having detected as an organization prevents it being interpreted 
as a verb which will have important consequences in the syntactic processing. 



Entities and Syntactic Analysis. More interestingly, another kind of im- 
provement concerns syntactic analysis directly. In fact, entity categorization gives 
some semantic information that can benefit the syntactic analysis. We can take 
a concrete case that represents one of the difficult points in the syntactic pro- 
cessing, namely the treatment of coordination. 

Consider for instance the two following utterances where geographical entities 
are marked up with <LOC> and organizations with <ORG>: 



< / > < > 

and 



</ > 



< > 



< > 

</ > < >. -</ > 

In the first example, the fact that and , are both geographical 

entities enables us to consider that they are coordinated together. In the contrary, 
in the second example, as is not of the same type than . , we will 

prefer to coordinate it with the name . . 



Entities : A Step Towards Semantics. Finally, when an entity name is de- 
pendent on a governor, knowing the semantic type of the entity can help deter- 
mining the kind of relationship that exists between the entity and its governor. It 
can help the labelling of semantic-oriented relations. Take the following example: 



- Knowing that is marked up as a geographical entity, 

- having the syntactic relation between and where 

is the governor and a dependent, 

- and knowing that is introduced by the preposition enables the 

MODIFIER syntactic relation to be further specified as a LOCALIZATION 
relation. 
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3.2 Entities and Deep Syntax 

Entity Metonymy. Recently, the ACE project had focused on 
, and 

. In the context of this project, a particular emphasis is placed 
on named entity metonymy, namely the fact that a given entity may have dif- 
ferent “senses” , and therefore should be categorized differently according to the 
context. For instance, the word can be marked as a geographical entity 

(LOG), but when used in the following context: 

, it is obvious that should not be 

considered as a geographical unit but as a human organization. In a similar way, 
the word in . , , 

should not be typed as an organization (Swiss watch designer company) but as 
a common noun (watches made by the Rolex company). This phenomenon is 
distinct from ’’basic ambiguity”, as for , which is either a location or 

a person. Therefore ACE focuses on semantic analysis whereas previous project 
in the same line, like MUG, focussed on linguistic analysis. 

In this context, Maynard et al.[II] show how they adapt their “standard” 
entity detection system to the ACE requirement, in a very efficient way, taking 
advantage of the modularity of the GATE architecture. However, the adaptation 
of the system to take into account the phenomena of metonymy is not described. 
Furthermore, in the work described by Maynard et ah, parsing is not one of the 
stages in the processing chain. 

Along these lines, we decided to do an experiment on the contribution of 
deep syntactic parsing to named entity metonymy detection, making use of the 
flexibility of our architecture. 

The system of named entity extraction we presented above can be enriched 
and improved using the results of robust deep syntactic processing with some 
limited lexical semantic coding. As claimed in McDonald[10], richer contextual 
information is necessary for high accuracy, and in our case this richer information 
consists in deep parsing results combined with lexical knowledge. 

The enrichment, provided by an independent module, can be applied to any 
kind of named entity extraction system (rule-based in our case but it could also 
be used on top of a learning system or a mixed approach). This enables better 
semantic categorization of entities and also the discovery of named entities that 
are not easily detectable within a restricted context. Moreover, this informa- 
tion will allow entity categorization to be overridden or more precisely specified 
when some strings that could denote entity names are used as common nouns 
(e.g. where , is here an artefact, i.e. a car 

of brand Alfa Romeo). To a certain extent, the task we want to perform is sim- 
ilar to Word Sense Disambiguation (see for example Ide and Veronis[8]), in the 
context of named entity categorization: we use prototypical information about 
subcategorization frames to disambiguate named entities. 

In the following subsections we describe our module for the refinement of 
semantic categorization of named entities, which is based on deep syntactic pro- 
cessing. 
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This deep syntactic processing produces a normalized syntactic analysis (tak- 
ing advantage of morphological properties of words and of a minimal lexical 
semantic information). At the end of the analysis process, the labeling of previ- 
ously detected entities is refined. Some categories attached to entities that have 
been detected may be overridden by others or simply discarded. Moreover, it is 
important to observe that our methodology improves entity detection for other 
kind of problems: 

— It helps characterize highly ambiguous entities, since they are syntactically 
related to words which bear some semantic features (e.g. “<PERS> Turner 
</PERS> says...” vs. “<LOC> Turner </LOC> is located in the hills of 
Western Maine” ) 

~ It enables entities to be typed even when gazette er information is missing 
(e.g. “<ORG> DELFMEMS </ORG> will be a new company created in 
2004”). 

It is important to notice that we do not follow the AGE guidelines (AGE [5]) 
exactly, in particular for the Geopolitical entities (GPE, i.e. “composite entities 
comprised of a population, a government, a physical location and a nation”), for 
which we have a less fine-grained tagset. Indeed, we do not keep the distinction 
between: 



— . . . : GPE with role ORG; 

: GPE with role PERS; 

— . , : GPE with role LOG; 

— . : GPE with role GPE; 

Since for us, an organisation is a basically a group of people, we simply aim to 
distinguish the location sense of GPE from the other senses, which we consider 
as organization. 

Moreover, our system focuses on proper noun categorization: it won’t attempt 
to spot common noun like , as in AGE. 

How Deep Syntax Can Help. The contextual rules of the entity recognition 
module described above enable us to catch and categorize named entities with 
reasonable precision. In this section we show how deep syntax can help and 
improve the named entity recognition task in the following ways: 

— refine a rough but correct categorization done in a previous step, 

— detect some entities that have not been detected previously, 

— override an incorrect entity categorization that has been previously made. 

As said before, our system can extract deep syntactic relations between predi- 
cates and their arguments. Having some information about selectional restriction 
of predicates appearing in the text thus enables us to check and possibly correct 
the tagging of named entities which are arguments of these predicates. In this 
paper, what especially interests us is the relation that links a predicate 
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to its normalized subject. If a predicate denotes an activity that is typically 
performed by human beings, then we can be sure that the normalized subject 
of this predicate has to bear a human feature. Having this in mind, when en- 
tities are found to be involved in a dependency with a predicate that 

expresses a human activity, then we know that this entity cannot be a product 
or or a location but something that denotes one or a group of human beings. 
This enables the refinement of named entity tags found in a previous step or 
possibly the detection of some entities that have not been previously detected 
because of a lack of context. For instance in the example as 

is typically a verb whose deep subject is a human or a group of human 
beings, we can easily that the interpretation for the entity is not 

correct and that is here an organisation. 

Furthermore, simple syntactic contexts can be clues to override erroneous 
named entity categorization. For instance in the example 

, as we have a possessive determining the word , it is 

very unlikely that has the status of a company name and hence is not 

a named entity even if starting with upper-case and present in some gazetteer. 

The limited lexical coding that we performed in order to refine our named 
entity categorization module consists mostly in the categorization of a set of 
verbs expecting a human (or at least a living) being or a group of human beings 
as normalized subject. 

As much as possible, we use pre-defined semantic verb classes, such as Levin’s 
classes (see Levin [9]) . Interesting classes that we found in Levin are the following: 

— “Learn verbs”, for instance , , , , etc. 

— “Social interaction verbs”, for instance , , , , etc. 

— “Communication verbs”, for instance , , , , etc. 

~ “Ingestion verbs” , for instance , , , , etc. 

— “Killing verbs”, for instance , , , , etc. 

Together with these significant Levin classes, we also use verbs that introduce 
indirect speech which were already marked in our system, as they possess specific 
syntactic properties (inversion of the subject for instance). This class of verb was 
extracted from the COMLEX lexicon [4] and consists in verbs like 
, etc. 

We give below an example of a XIP rule showing that if we find that the 
deep subject of a verb of human activity is tagged as a named entity denoting a 
location, then, we retract this interpretation and we type this entity as a person 
or human organisation (PERS_OR_ORG unary relation). 

if ( SUBJ-N(#1 [hmnan_activ] ,#2) & ~L0CATI0N(#2) ) 

PERS_0R_DRG(#2) 

The next section shows that this limited and straighforward classification, 
even if modest, refines some of the entity categorization performed by the general 
system. 
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3.3 About Evaluation 

In this section we describe a small experiment we performed in order to eval- 
uate the impact of deep syntactic analysis on the named entity detection and 
categorization task. 

In this experiment, the challenge is different than what we expect from the 
general named entity detection system. We want to be able to distinguish here 
when an entity like denotes a place and when it denotes an organization 

(see ACE and GPE distinctions above). Once again, we use a small corpus of 
about 500 sentences and annotate it manually, taking metonymy into account. 

Results we obtained with the general entity detection system on the refined 
categories set are the following: 

— Precision : 81 % 

- Recall : 75 % 

Using deep syntax and the limited lexical information we mentionned above, 
we obtained the following results: 

— Precision : 85 % 

- Recall : 74 % 

These results show that with a minimal effort in lexical coding and with 
the use of only two kinds of grammatical relations (namely deep-subject and 
possessive), precision increases. We expect that a deeper study on the impact of 
syntactic properties on entity categorization will enable us to go further in that 
direction. 



4 Conclusions and Future Work 

In this paper, we present a robust architecture for deep syntactic analysis, en- 
riched by a named entity detection module. Named entity detection is often seen 
as an independent NLP task, but we show that a syntactic parser can benefit 
from it as a pre-processing step. Moreover, since recent trends in named entity 
categorization focus on semantic analysis (e.g. metonymy), we think that deep 
syntactic information is necessary to bridge the gap between a linguistic analysis 
and a deeper semantic analysis of named entities. We thus propose a system that 
interleaves parsing and entity detection. First results are encouraging, and we 
plan to pursue our work with a deeper linguistic study of the syntactic informa- 
tion needed to improve our system. 

In addition to that, we think that Word Sense Disambiguation is a task that 
should make the most of entity categorization. We developed previously a rule- 
based Word Sense Disambiguation system of which one the main components 
is our syntactic parser (Brun and Segond[3]). Since the integration of named 
entity categorization results is handled directly by our architecture, it could be 
worthwhile to evaluate the consequences on the Word Sense Disambiguation 
task. 
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Abstract. In this work, we present an approach to language under- 
standing nsing corpus-based and statistical language models based on 
multigrams. Assuming that we can assign meanings to segments of words, 
the n-multigram modelization is a good approach to model sequences 
of segments that have semantic information associated to them. This 
approach has been applied to the task of speech understanding in the 
framework of a dialogue system that answers queries about train timeta- 
bles in Spanish. Some experimental results are also reported. 



1 Introduction 

Nowadays, the use of automatic learning techniques for Language Modelling is 
quite extensive in the field of Human Language Technologies. A good example 
of this can be found in the development of Spoken Dialogue systems. Very im- 
portant components of these systems, such as the Language Model of the speech 
recognition component, the Language Model of the understanding component 
and the dialogue structure, can be modelled by corpus-based and statistical 
finite-state models. This is the case of the widely used n-gram models [1][2]. 

An n-gram model assigns a probability to a word depending on the previous 
n-1 words observed in the recent history of that word in the sentence. In these 
models, the word is the selected linguistic unit for the language modelization, 
and the recent history considered for the model always has the same length. Over 
the last few years, there has been an increasing interest in statistical language 
models which try to take into account the dependencies among a variable number 
of words; that is, stochastic models in which the probability associated to a word 
depends on the occurrence of a variable number of words in its recent history. 
This is the case of the grammar-based approaches [3] [4] [5], in which models 
take into account variable-length dependencies by conditioning the probability 
of each word with a context of variable length. In contrast, in segment-based 
approaches such as multigrams [6] [7] [8], sentences are structured into variable- 
length segments, and probabilities are assigned to segments instead of words. 
In other words, multigram approaches to language modelling take segments of 
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words as basic units; these models try to naturally model the fact that there are 
certain concatenations of words that occur very frequently. These segments could 
constitute one of the following: relevant linguistic units; syntactic or semantic 
units; or a concatenation of words, which works well from the language modelling 
point of view (independently of the linguistic relevance of the segments) . 

Language Understanding systems have many applications in several areas 
of Natural Language Processing. Typical applications are train or plane travel 
information retrieval, car navigation systems or information desks. In the last 
few years, many efforts have been made in the development of natural language 
dialog systems which allow us to extract information from databases. The inter- 
action with the machine to obtain this kind of information requires some dialog 
turns. In these turns, the user and the system interchange information in order 
to achieve the objective: the answer to a query made by the user. Each turn (a 
sequence of natural language sentences) of the user must be understood by the 
system. Therefore, an acceptable behavior of the understanding component of 
the system is essential to the correct performance of the whole dialog system. 

Language understanding can be seen as a transduction process from sentences 
to a representation of their meaning. Frequently, this semantic representation 
consists of sequences of concepts (semantic units). Multigram models have been 
applied to language modeling tasks [6] [8]; however, as the multigram approach 
is based on considering a sentence as a sequence of variable-length segments of 
words, it could be interesting to apply this methodology to language understand- 
ing. 

In this work, we propose the application of multigram models to language 
understanding. In this proposal, we associate semantic units to segments, and 
we modelize the concatenation of semantic units as well as the association of 
segments of words to each semantic unit. This approach has been applied to 
the understanding process in the BASURDE dialog system [9] . The BASURDE 
system answers telephone queries about railway timetables in Spanish. 

This paper is organized as follows: in Section 2, the concept of is 

described. In Section 3, the task of language understanding is presented and in 
Section 4, the application of to language understanding is proposed. 

In Section 5, the two grammatical inference techniques used in the experiments 
are illustrated. Finally, results of the application of the models to 

a task of language understanding and some concluding remarks are presented. 

2 The n-multigram Model 

Let IT be a vocabulary of words, and let w = Wi W 2 W 4 be a sentence defined 
on this vocabulary, where Wi G IT, z = 1, ... ,4. From a multigram framework 
point of view, the sentence w has the set of all possible segmentations of the 
sentence associated to it. Let S be this set of segmentations for w. If we use 
the symbol # to express the concatenation of words which constitutes the same 
segment. S' is as follows: 
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{ Wl W2 W3 U>4, Wl W2 W3#W4, 

Wl W2#W3 W4, Wl#W 2 W 3 W 4 , 

Wl W2#W3#W4, Wl#W2 W3#W4, 

W1#W2#W3 W4, Wl#W2#W3#W4 , 

From a multigram language model point of view, the likelihood of a sentence 
is computed by summing up the likelihood values of all possible segmentations 
of the sentence into segments [6] [8]; let w be a sentence, and let S be the set of 
all segmentations of w. The likelihood of the sentence C{w) given a multigram 
language model is: 

C{w) s) (1) 

sGS 

where, C{w, s) is the likelihood of the sentence w given the segmentation s. 

The likelihood of any particular segmentation depends on the model assumed 
to describe the dependencies between the segments. The most usual approach, 
called [8] , assumes that the likelihood of a segment depends on the 

n-1 segments that precede it. This approach can be seen as an extension of n- 
gram models of words to n-gram models of segments. Therefore, the likelihood 
of a segmentation is: 

H{w ^ s) p(S(r) I S(r — n + l) ' ' ' 1)) (2) 



where represents the r-th segment in the segmentation s. 

Due to the high number of parameters to estimate, it is convenient to define 
classes of segments. The formula presented above becomes: 



r(u>, s) = nPl*". 



9(y-)) 



|c r 






( 3 ) 



where q is a function that assigns a class to each segment (it is generally assumed 
that a segment is associated to only one class); is the class assigned to the 

segment and p (si|C'q(sj)) is the probability of the segment Si in its class. 

Thus, an model based on classes of segments is completely de- 

fined by: the probability distribution of sequences of classes, and the classification 
function q. 

As we mentioned above, to obtain the likelihood of a sentence C{w), all the 
possible segmentations have to be taken into account. It can be computed as 
follows: 



r(w) = a(|w|) (4) 

where a{t) is the likelihood of wi ■ ■ -Wt, that is, the prefix of w of length t. 
Considering that the number of words in the segments is limited by I, we can 
define a{t) as: 

i 

a{t) 

i=l 



( 5 ) 
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where ai{t) is the likelihood of the sentence which only takes into 

account those segmentations whose last segment is composed by i words. This 
value can be calculated as: 



1 



t = 0 A z = 1 



(6) 

5] ar{t - i) • \w\z\_r+i) : 1 < t < k| A 1 < z < ; 

\ r—1 

In the case of considering classes of segments (and assuming that each seg- 
ment has only been associated to a class), (6) can be calculated by: 



1 



t = 0 A z = 1 



^{t) = { 



E < 






:l<t<|zc|Al<z<^ 






(7) 

where ^ is the class associated to the segment Wm-i# ■ ■ ■ #Wm by the 

function q, and .)) is the probability that this segment belongs 

to the category 



3 Language Understanding 

A language understanding system can be viewed as a transducer in which nat- 
ural language sentences are the input and their corresponding semantic repre- 
sentation (frames) are the output. In [10], an approach for the development of 
language understanding systems that is based on automatic learning techniques 
has been presented. In this approach, the process of translation is divided into 
two phases: the first phase transduces the input sentence into a semantic sen- 
tence (a sequence of semantic units) which is defined in a sequential Intermediate 
Semantic Language (ISL). The second phase transduces the semantic sentence 
into its corresponding frames. Automatic learning techniques are applied in the 
first phase, and the second phase is performed by a simple rule-based system. 

As the ISL sentences are sequential with the input language, we can perform 
a segmentation of the input sentence into a number of intervals which is equal 
to the number of semantic units in the corresponding semantic sentence. Let 
W be the vocabulary of the task (set of words), and let V be the alphabet of 
semantic units; each sentence w G W* has a pair (u,v) associated to it, where v 
is a sequence of semantic units and zz is a sequence of segments of words. That 
is, V = V1V2 ■ ■ ■ Um Ui G L, Z = 1, . . . , n U = U1U2 ■ ■ ■ ZZn, Ui = . . . ZCi|^ ^ , 

Wi. GW,i = l,...,n,j = 1,..., jzzij. 

For example, in the BASURDE corpus, the sentence ”me podria decir los 
horarios de trenes para Barcelona” ( 
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), whose translation in ISL could be consulta <hora_salida> marcador 
_destino ciudad_destino {query <depart_time> destination_marker destination_city) , 
has the following pair associated to it, which is the output of the first phase 
of our understanding approach. 



(u,v)=(uiU 2 U 3 U 4 , V1V2V3V4) where: 



Spanish English 



Ui: me podria decir 


Vi: consulta 


Ul 


can you tell me 


vi: 


query 


U2- los horarios de trenes 


V2- <hora_salida> 


U2 


the railway timetables 


V2-. 


< depart ure_time> 


U3: para 


U3: marcador _destino 




to 


V3: 


destination_marker 


U4: Barcelona 


V4: ciudad.destino 


U4 


Barcelona 


V4: 


destination.city 



The output for the second phase is the following frame: 

(DEPART.TIME) 

DESTINATION.CITY: Barcelona 



4 Applying n-multigram Models to Language 
Understanding 

The first phase of our language understanding approach, that is, the semantic 
segmentation of a sentence w, consists of dividing the sentence into a sequence 
of segments (u = U\U2 ■ ■ ■ Un) and associating a semantic unit Vi € V to each 
segment Ui. Due to the fact that the multigram approach considers sentences 
as sequences of segments of variable length, it seems especially appropriate for 
the semantic segmentation problem. In order to apply to language 

understanding, the following considerations must be taken into account: 

~ The training corpus consists of a set of segmented and semantically labelled 
sentences. 

— The classes of segments are the semantic vocabulary V, which is obtained 
from the definition of the ISL for the understanding task. 

~ A segment could be assigned to several classes, so the classification function 
q must be adapted in order to provide a membership probability for each 
segment in each class. 

4.1 Learning the Semantic Model 

In an model, there are two kind of probability distributions that 

have to be estimated: a) the language model for the sequences of classes of 
segments and b) the membership probability of segments in classes. 

The probability distribution (a) is estimated as an n-gram model of classes of 
segments from the sequences of semantic units v = U1U2 . • ■ Un, Ui G U, f = 1 , . . . , n 
of the training set. 

A language model is learnt from the set of segments of words Ui of the train- 
ing set associated to each class. These language models provide the membership 
probability of segments in classes (b) . We have explored some approaches to au- 
tomatically obtain these language models. In particular, we used n-gram mod- 
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els, the Prefix-Tree Acceptor (PTA)^ estimated from the training set, and two 
Grammatical Inference (GI) techniques (MGGI and EGGI) which are described 
in section 5. 

We define a function of segment classification q : E* ^ 2^ 

9(s) = Vt>i e P (8) 

where q{s) is the set of all the classes such as p{s \ Vi) > 0. 

4.2 The Semantic Segmentation 

Once the model that represents the sequences of semantic units and the proba- 
bility distributions of membership for each class are learnt, the process to obtain 
the likelihood a is obtained through: 

a = J2 

i = l «e9(tuj_t+l) 

This formula (9) is an adaptation of (5) for the case of permitting the assig- 
nation of more than one class to each segment. a(t) represents the accumulated 
likelihood taking into account all the segmentations for the t first words of the 
sentence (rui ...Wt). And a"{t) represents the likelihood of all the segmenta- 
tions of the t first words, taking into account only those that end with a segment 
assigned to the class v and that have length i. It is calculated as follows: 



■ W = < * 

T.pH-i+i\v) 



1 

0 

E ar {t 



: t — Q /\i = 1 

: t = 0 A z > 1 

i) ■ p{v\v') :l<t<|ui|Al<z</ 

(10) 



The segmentation of maximum likelihood in terms of semantic units is ob- 
tained by using the Viterbi algorithm. Let w = WiW 2 • • ■ W|m| be the sentence to 
analyze. The best segmentation is given by: 



(u, v) = argmax C{w, s) (11) 

sGS, q(s-r) 

where S is the set of all segmentations of w, and q{sr) is the set of all classes 
that can be assigned to the r-th segment of the segmentation s. 



5 The Two GI Techniques: The MGGI Methodology and 
the EGGI Algorithm 

As we mentioned above, we used two Grammatical Inference techniques in order 
to obtain the language models which represent the membership probability of 



The finite state automaton that only accepts the strings in the training set. 



1 
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segments in classes. These techniques are the Morphic Generator Grammatical 
Inference (MGGI) methodology [11] [12] and the Error Gorrecting Grammatical 
Inference (EGGI) algorithm [13] [14]. In this section, we briefly describe these 
two GI techniques. 

5.1 MGGI 

The Morphic Generator Grammatical Inference (MGGI) methodology is a gram- 
matical inference technique that allows us to obtain a certain variety of regular 
languages. The application of this methodology implies the definition of a re- 
naming function, that is, each symbol of each input sample is renamed following 
a given function g. Different definitions of the function g will produce different 
models (stochastic regular automata). 

Let i? be a sample over the alphabet S. Let 17 be a finite alphabet. Let h be 
a letter-to- letter morphism, /i : L7 * — > 27*, and g a renaming function, g : R ^ 
27 . The Regular Language, L, generated by the MGGI-inferred grammar, G, 
is related to R through the expression: L = h{l{g{R))), where l{g{R)) is inferred 
from g{R) through the 2-Testable in the Strict Sense (2-TSS) inference algorithm 
[15]. 

Next, we describe the MGGI methodology through an example. In order 
to better explain its performance, we show the 2-TSS model estimated from a 
training set and the finite automaton estimated through the MGGI methodology 
from the same training set: 

Let 27 = {a, b} be an alphabet and let R = {aabaa, abba, abbbba, aabbbba} 
be a training set over 27. 

a) 2-Testable in the Strict Sense inference algorithm (its stochastic version is 
equivalent to bigrams) : 



a b 




Fig. 1. The finite automaton inferred from R by the 2-Testable in the Strict Sense 
algorithm 

The language accepted by the automaton of Figure 1 is: L{A1) = a -I- a{b + 
a)* a. That is, the strings in L{A1) begin and end with the symbol a and contain 
any segment over the alphabet 27. For example, a € L{A1), aaa G L{A1), ababa G 
L(A1), abbba G L{A1), etc. 

b) MGGI algorithm: 

The renaming function g : 27* — *■ 27 is defined in this example as the relative 
position, considering that each string is divided into 2 intervals. This definition 
allows distinguishing between the first and the second parts of the strings. 

g{R) = {0101610202, 016162O2, 0161616262O2, 01016161626202} 

The language accepted by the automaton in Figure 2 is: L{A2) = a~^b^a~^. 
That is, the strings in L{A2) contain a sequence of 1 or more symbols o, followed 
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Fig. 2. The finite automaton inferred from R by MGGI algorithm, with a relative 
position renaming function 



by a sequence of 1 or more symbols b, and followed by a sequence of 1 or more 
symbols a. For example abbba G L{A2) but a ^ L{A2), aaa ^ L{A2) and ababa ^ 
L{A2). As can be observed from this example, the language inferred by the 
MGGI methodology is more similar to the strings in the training set than those 
inferred by the 2-TSS algorithm. 

There is another interesting feature of the MGGI methodology. Given a task 
(a language to modelize), we can choose an adequate definition of a renaming 
function g. Different definitions of this function produce different models. 

5.2 ECGI 

The EGGI algorithm is a Grammatical Inference algorithm that infers a finite- 
state model in an incremental way and is based on an error correcting parsing. 
The EGGI builds a finite-state automaton through the following incremental 
procedure: initially, a trivial automaton is built from the first training word. 
Then, for every new word, which can not be exactly recognized by the current 
automaton, the automaton is updated by adding to it those states and transi- 
tions which are required to accept the new word. To determine such states and 
transitions, an error correcting parsing is used to find the best path for the in- 
put word in the current automaton. The error rules defined for the analysis are: 
substitution, insertion and deletion of symbols. This way, the error correcting 
parsing finds the word in the current inferred language which is closest to the 
new word, according to the Levenshtein distance. To generate an automaton 
which is free of loops and circuits, some heuristic restrictions are imposed in the 
process of adding new states and transitions. Thus, the inference procedure at- 
tempts to model the duration and the position of the substructures that appear 
in the training data. Figure 3 shows the inference process of the EGGI algorithm 
using the same training set of the examples in Figures 1 and 2. 



6 Experimental Results 

In order to evaluate the performance of the models in a language 

understanding task, a set of experiments was conducted on the BASURDE [9] 
[10] dialog system, which answers queries about train timetables by telephone in 
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(a) Initial automaton inferred from aabaa 




(b) Automaton after analyzing the string abba. A substitution rule and a deletion 
rule have been applied to accept this string. A new state, ge, and some transitions 

have been added. 




(c) Automaton after analyzing the string abbbba. Two insertion rules have been used. 
Two new states, qr and gg, and some transitions have been added. 




new state, gg, and some transitions have been added. 
Fig. 3. The inference process of the ECGI algorithm using R 



Spanish. The corpus consisted of a set of 215 dialogs, obtained through a Wizard 
of Oz technique [16]. These dialogs contained 1,440 user turns, with 14,902 words 
and a vocabulary of 637 different words. A cross-validation procedure was used 
to evaluate the performance of our language understanding models. To this end, 
the experimental set was randomly split into five subsets of 288 turns. Our 
experiment consisted of five trials, each of which had a different combination of 
one subset taken from the five subsets as the test set, with the remaining 1,168 
turns being used as the training set. 

We defined several measures to evaluate the accuracy of the strategy in both 
phases of the understanding process: 

— The percentage of correct sequences of semantic units (%cssu) . 

— The percentage of correct semantic units (%csu). 

— The semantic precision (%Ps), which is the rate between the number of 
correct proposed semantic units and the number of proposed semantic units. 

— The semantic recall (%Rg), which is the rate between the number of correct 
proposed semantic units and the number of semantic units in the reference. 

— The percentage of correct frames (%cf), which is the percentage of resulting 
frames that are exactly the same as the corresponding reference frame. 



216 



L. Hurtado et al. 



Table 1. Results of two language understanding models based on bi-multigrams: bi- 
multigram-Hl y bi-multigram-TRI 



Semantic Segmentation 


%cssu 


%csu 


%Ps 


%Rs 


bi-multigram-'Bl 


68.6 


87.8 


91.0 


91.3 


bi-multigram-TRl 


69.2 


87.8 


91.4 


90.9 


Frame Transduction 


%cf 


%cfs 


%Pf 


%Rf 


bi-multigram-'Bl 


78.5 


85.7 


88.9 


91.3 


bi-multigram-TRl 


79.0 


86.1 


89.6 


91.0 



— The percentage of correct frame slots (frame name and its attributes) (%cfs). 

— The frame precision (%Pf), which is the rate between the number of correct 
proposed frame slots and the number of proposed frame slots. 

— The frame recall (%Rf), which is the rate between the number of correct 
proposed frame slots and the number of frame slots in the reference. 



A first set of experiments was performed using the orthographic transcrip- 
tion of user turns. In these experiments, different language understanding mod- 
els based on multigrams were used. In all the experiments, the modelization 
of the sequences of segments (semantic units) was done by the 
probabilities. In other words, the probability that a segment of words repre- 
senting a semantic unit appears depends on the previous segment (the term 
P (c',(.(^,)|C',(s(_„+i,) • • • in formula (3)). The difference among the 

experiments consisted in which classification function (of the segments in classes) 
was chosen (the term p ))) ^’^ formula (3)). Table 1 shows the results 



obtained using bigrams and trigrams of words to assign membership probabilities 
of the segments to each class. 

Table 2 shows the results obtained by using the PTA, the MGGI and the 
EGGI techniques (described in section 5). These techniques obtain stochastic 
automata that represent the segments of words that can be associated to each 
class. These automata represent a modelization of the training samples and 
supply the membership probabilities associated to the segments. In order to 
increase the coverage of these models, we used a smoothing method for stochastic 
finite automata, which was recently proposed in [17]. 

These results for all techniques are similar. Nevertheless, the models obtained 
by GI techniques slightly outperform the n-gram models; this could be explained 
by the fact that the GI techniques better represent the structure of the segments 
of the training set. 

In order to study the performance of the proposed models in real situations, 
we also performed a second set of experiments using the recognized utterances 
supplied by a speech recognizer from the same set of dialogs. The recognizer 
[9] used Hidden Markov Models as acoustic models and bigrams as the lan- 
guage model; its Word Accuracy for the BASURDE corpus was 80.7%. In these 
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Table 2. Results for the language understanding models: bi-multigram-PTA, bi- 
multigram-MGGl and bi-multigram-EGGl 



Semantic Segmentation 


%cssu 


%csu 


%Ps 


%Rs 


bi-multigram-PTA 


69.0 


88.0 


91.5 


91.1 


bi-multigram-MGGl 


69.3 


88.3 


91.6 


91.3 


bi-multigram-ECGl 


68.9 


88.3 


91.4 


91.8 


Frame transduction 


%cf 


%cfs 


%Pf 


%Rf 


bi-multigram-PTA 


78.5 


86.0 


89.3 


91.0 


bi-multigram-MGGl 


79.0 


86.3 


89.6 


91.4 


bi-multigram-ECGl 


80.0 


87.2 


90.1 


92.5 



Table 3. Results for the understanding models based on bi-multigrams, with different 
models of membership probabilities, using the output of the recognizer 



Semantic Segmentation 


%cssu 


%csu 


%Ps 


%Rs 


bi-multigram-El 


44.3 


74.3 


81.6 


82.5 


bi-multigram-TPl 


44.3 


74.4 


82.1 


82.0 


bi-multigram- AP 


44.3 


74.4 


81.7 


82.4 


bi-multigram-MGGl 


44.4 


74.6 


81.8 


82.7 


bi-multigram-ECGl 


44.7 


74.9 


81.9 


83.2 


Frame Transduction 


%cf 


%cfs 


%Pf 


%Rf 


bi-multigram-El 


54.9 


69.6 


77.6 


81.7 


bi-multigram-TPl 


55.0 


69.8 


78.1 


81.4 


bi-multigram- AP 


55.0 


69.3 


77.5 


81.5 


bi-multigram-MGGl 


55.4 


70.1 


78.1 


81.8 


bi-multigram-ECGl 


56.4 


70.8 


78.7 


82.5 



experiments, the models were the same as before; that is, they were estimated 
from the orthographic transcription of the sentences, but the test was done with 
the recognized sentences. Table 3 shows the results for the recognized dialogs, 
using the same understanding models based on as in the previous 

experiments (Tables 1 and 2). 

The results show a generalized reduction in the performance of the under- 
standing system, if we compare them with those corresponding to transcribed 
utterances. This reduction is especially significant (over 20%) for %cssu and 
%cf. They respectively measure the percentage of sentences that are segmented 
perfectly and the percentage of sentences that are perfectly translated to their 
corresponding frames. Although less than 42% of the sentences are correctly 
recognized, the understanding system, -ECGI, is able to correctly 

understand 56% of the sentences. As far as recall and precision, the reduction is 
much lower, around 10% in most cases. The best model was the 
ECGI. It obtained a Precision (P/) and a Recall (i?/) of 78.7% and 82.5%, 
respectively, at frame level. 
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7 Conclusions 

We have presented an approach to language understanding that is based on 
multigrams. In our language understanding system, segments of words represent 
semantic information. Therefore, considering a language modelization based on 
variable length segments is an appropriate approach. We have proposed different 
methods to assign segments to semantic units and we have applied our language 
understanding system based on multigrams to a dialog task. Our results are 
similar to those obtained by other approaches [10] and show that the proposed 
methodology is appropriate for the task. This modelization has the advantage 
that different models can be used to represent the segments associated to the 
semantic units; even different modelizations can be applied to different semantic 
units. 

We think that the understanding system could be improved by adding infor- 
mation about the relevant keywords for each class in the classification function. 
It would also be interesting to develop techniques to automatically obtain the 
set of classes of segments, that is the set of semantic units, which in our system 
are manually defined. 
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Abstract. Traditional approaches to pattern recognition tasks normally consider 
only the unilahel classification problem, that is, each observation (both in the 
training and test sets) has one unique class label associated to it. Yet in many real- 
world tasks this is only a rough approximation, as one sample can be labeled with a 
set of classes and thus techniques for the more general multi-label problem have to 
be explored. In this paper we review the techniques presented in our previous work 
and discuss its application to the field of text classification, using the multinomial 
(Naive Bayes) classifier. Results are presented on the Reuters-21578 dataset, and 
our proposed approach obtains satisfying results. 



1 Introduction 

Traditional approaches to pattern recognition tasks normally consider only the unilabel 
classification problem, that is, each observation (both in the training and test sets) has 
one unique class label associated to it. Yet in many real-world tasks this is only a rough 
approximation, as one sample can be labeled with a set of classes and thus techniques 
for the more general multi-label problem have to be explored. In particular, for multi- 
labeled documents, text classification is the problem of assigning a text document into 
one or more topic categories or classes [1]. There are many ways to deal with this 
problem. Most of them involve learning a number of different binary classifiers and 
use the outputs of those classifiers to determine the label or labels of a new sample[2]. 
We explore this approach using a multinomial (Naive Bayes) classifier and results are 
presented on the Reuters-21578 dataset. Furthermore, we explore the result that using an 
accumulated posterior probability approach to multi-label text classification performs 
favorably compared to the more standard binary approach to multi-label classification. 
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The methods we discuss in this paper were applied to the classification phase of a 
dialogue system using neural networks [3], but the simplicity of the methods allows us to 
easily extend the same ideas to other application areas and other types of classifiers, such 
as the multinomial Naive Bayes classifier considered in this work for text classification. 

2 Unilabel and Multi-label Classification Problems 

Unilabel classification problems involve finding a definition for an unknown function 
k*{x) whose range is a discrete set containing \C\ values (i.e., \C\ “classes” of the set of 
classes C = . . . , The definition is acquired by studying collections 

of training samples of the form 

{(^ni C-n)}n=l ; C C , (1) 

where is the n-th sample and c„ is its corresponding class label. 

For example, in handwritten digit recognition, the function k* maps each handwritten 
digit to one of |C| = 10 classes. The Bayes decision rule for minimizing the probability 
of error is to assign the class with maximum a posteriori probability to the sample x\ 

k*{x) = argmaxPr(fc|a;) . (2) 

k^C 

In contrast to the unilabel classification problem, in other real-world learning tasks 
the unknown function k* can take more than one value from the set of classes C. For 
example, in many important document classification tasks, like the Reuters-21578 corpus 
we will consider in Section 4, documents may each be associated with multiple class 
labels [1, 4]. In this case, the training set is composed of pairs of the form 

C„CC. (3) 

Note that the unilabel classification problem is a special case in which |C„| = 1 for 
all samples. 

There are two common approaches to this problem of classification of objects as- 
sociated with multiple class labels. The first is to use specialized solutions like the 
accumulated posterior probability approach described in the next section. The second is 
to build a binary classifier for each class as explained afterwards. 

Note that in certain practical situations, the amount of possible multiple labels is 
limited due to the nature of the task and this can lead to a simplification of the problem. For 
instance, if we know that the only possible appearing multiple labels can be 
and we do not need to consider all the possible combinations of the initial 

labels. In such situations we can handle this task as an unilabel classification problem 
with the extended set of labels C defined as a subset of 7^(C). The question whether this 
method can be reliably used is highly task-dependent. 

2.1 Accumulated Posterior Probability 

In a traditional (unilabel) classification system, given an estimation of the a posteriori 
probabilities Pr(fc|a;), we can think of a classification as “better estimated” if the proba- 
bility of the destination class is above some threshold (i.e., the classification of a sample 
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X as belonging to class k is better estimated if Pr(fc|a;) = 0.9 than if it is only 0.4). A 
generalization of this principle can be applied to the multi-label classification problem. 

We can consider that we have correctly classified a sample only if the sum of 
the a posteriori probabilities of the assigned classes is above some threshold T. Let 
us define this concept more formally. Suppose we have an ordering (permutation) 
k^'^\ . . . , of the set C for a sample x, such that 

Pr(fcW|a?) > Pr(/c(*+i)|a;) VI < i < |C| . (4) 

We define the accumulated posterior probability for the sample x as 

3 

Prj(a:) = ^Pr(fc^*^|a?) 1 < J < |C| . (5) 

Using the above equation, we classify the sample a; in n classes, being n the smallest 
number such that 

Pr„(a:)>r, (6) 

where the threshold T must also be learned automatically in the training process. Then, 

the set of classification labels for the sample x is simply 

= (7) 



2.2 Binary Classifiers 

Another possibility is to treat each class as a separate binary classification problem 
(as in [5, 6, 7]). Each such problem answers the question, whether a sample should be 
assigned to a particular class or not. 

For C C C, let us define C[c] for c € C to be: 



C[c] 



true, if c e C ; 
false, if c ^ C . 



( 8 ) 



A natural reduction of the multi-label classification problem is to map each multi- 
labeled sample {x, C) to \C\ binary-labeled samples of the form ((a:, c), C[c\) for all 
c G C;thatis, each sample is formally a pair, {x,c), and the associated binary label, C[c\. 
In other words, we can think of each observed class set C as specifying \C \ binary labels 
(depending on whether a class c is or not included in C), and we can then apply unilabel 
classification to this new problem. For instance, if a given training pair {x, C) is labeled 
with the classes andc*^-^^, (x, c(^^}),then \C\ binary-labeled samples are defined 

as {{x, true), ((re, true) and {{x, c), false) for the rest of classes c G C. 

Then a set of binary classifiers is trained, one for each class. The ith classifier is 
trained to discriminate between the ith class and the rest of the classes and the resulting 
classification rule is 

K*{x) = {k&C\ Pr(/c|ar) > T} , (9) 

being T a threshold which must also be learned. Note that in the standard binary classi- 
fication problem T = 0.5, but experiments have shown that better results are obtained if 
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we allow the more general formulation of equation (9). We can also allow more gener- 
alization by estimating one different threshold 7^ for each class, but this would mean an 
increased number of parameters to estimate and the approach with only one threshold 
often works well in practice. 

3 The Multinomial Model 

As application of the multi-label classification rules we will consider a text classification 
task, where each document will be assigned a W -dimensional vector of word counts, 
where W is the size of the vocabulary. This representation is known as “bag-of- words”. 
As classification model we use the Naive Bayes text classifier in its multinomial event 
model instantiation [8]. In this model, we make the assumption that the probability 
of each event (word occurrence) is independent of the word’s context and position 
in the document it appears, and thus the chosen representation is justified. Given the 
representation of a document by its counts x = . . . ,xwY the class-conditional 

probability is given by the multinomial distribution 

p{x\c) = p{x+\c)p{x\c,x+) =p(a;+|c)= — —\\p{w\c,x+Y'^ , (10) 



where w = 1, . . . , lU denotes the word variable, x+ = x^ is the length of docu- 
ment X, and p{w\c, x+) are the parameters of the distribution, with the restriction 

^p(w|c,x+) = l Vc,x+. (11) 

W 

In order to reduce the number of parameters to estimate we assume that the distri- 
bution parameters are independent of the length x+ and thus p{w\c, x+) = p{w\c), and 
that the length distribution is independent of the class c, so (10) becomes 

p{x\c) = . (12) 

Applying Bayes rule we obtain the unilabel classification rule 

k*{x) = argmax{p(c|at)} 
ceC 

= argmax{logp(c)p(tc|c)} 

ceC (13) 

= argmax < logp(c) + 2, logp(m|c) 

{ 

The multi-label classification rules can be adapted accordingly. 

To estimate the prior probabilities p{c) of the class and the parameters p{w\c) we 
apply the maximum-likelihood method. In the training phase we replicate the multi- 
labeled samples, that is, we transform the training set {{Xn, C'n)}^=i, Cn Q C into the 
the “unilabel” training set 





224 



D. Vilar et al. 



N 

M {{{Xn,Cn)}^^i) = U U 

n=l ceC„ (14) 

= : {(®n> Cn)}n=l- 

The log-likelihood function of this training set is then 

N / 

log£({p(c)}, {p(w|c)}) = ^ ( logp(c„) + E S^nw l0gp{w\Cn) 

U=1 \ w 

-f const({p(c)},{p(w|c)})^ . 

Using Lagrange multipliers we maximize this function under the constrains 

^p(c) = 1 and ^p(w|c) = 1, VI < c < |C| . (16) 

C W 

The resulting estimators' are the relative frequencies 

P{c) = E (17) 

and 

p{w\c) = — , (18) 

Z^ui' -‘''cm' 

where Nc = ^(cn, c) is the number of documents of class c and similarly New = 

J2n c)xnw is the total number of occurrences of word w in all the documents of 
class c. In this equations 6{-, •) denotes the Kronecker delta function, which is equal to 
one if its both arguments are equal and zero otherwise. 



3.1 Smoothing 

Parameter smoothing is required to counteract the effect of statistical variability of the 
training data, particularly when the number of parameters to estimate is relatively large 
in comparison with the amount of available data. As smoothing method we will use 
unigram interpolation [9]. 

The base of this method is known as absolute discounting and it consists of gaining 
“free” probability mass from the seen events by discounting a small constant b to every 
(positive) word count. The idea behind this model is to leave the high counts virtually 
unchanged, with the justification that for a corpus of approximately the same size, the 
counts will not differ much, and we can consider the “average” value, using a non-integer 
discounting. The gained probability mass for each class c is 



Mr = 



& • ll'u;' : New' > 0} 

^cw' 



( 19 ) 



* We will denote parameter estimations with the hat O symbol. 
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and is distributed in accordance to a generalized distribution, in our case, the unigram 
distribution 



p{w) 



Thw' The ^cw' 



(20) 



The final estimation thus becomes 



p{w\c) = max 




New - b \ 
Tw' New' J 



p{w)Mc . 



( 21 ) 



The selection of the discounting parameter b is crucial for the performance of the 
classifier. A possible way to estimate it is using the so called leaving-one-out technique. 
This can be considered as an extension of the cross-validation method [10, 11]. The 
main idea is to split the N observations (documents) of the training corpus into — 1 
observations that serve as training part and only 1 observation, the so called hold-out 
part, that will constitute the simulated training test. This process is repeated N times 
in such a way that every observation eventually constitutes the hold-out set. The main 
advantage of this method is that each observation is used for both the training and the 
hold-out part and thus we achieve and efficient exploitation of the given data. For the 
actual parameter estimation we again use maximum likelihood. For further details the 
reader is referred to [12]. 

No closed form solution for the estimation of b using leaving-one-out can be given. 
Nevertheless, an interval for the value of this parameter can be explicitly calculated as 



rn ^ ^ ^ ni 

rii + 2 ri 2 + Tr>3 + 2^2 



( 22 ) 



where Ur = Twb{TeNew,r) is the number of words that have been seen exactly 
r times in the training set. Since in general leaving-one-out tends to underestimate the 
effect of unseen events we choose to use the upper bound as the leaving-one-out estimate 



biio — 



ni 

rii + ri2 



(23) 



3.2 A Note About Implementation 



On the actual implementation of the multinomial classifier we can not directly compute 
the probabilities as given in equation (12) due to underflows in the computation of 
the exponentiation of the multinomial parameters^. In the unilabel classification tasks 
(and therefore in the extension to binary classifiers) we avoid this problem by using the 
joint probability in the maximization (see eq. (13)), but for the accumulated posterior 
probability approach we have to work with real posterior probabilities in order to handle 
the threshold in a correct way. A possibility to compute this probabilities in a numerically 
stable way is to introduce a maximum operation in Bayes rule 



p{c\x) 



p{x,c) 

maxc" p{x, c") 
p{x,c') 

m&yic" p{x,c") 



( 24 ) 



^ Note that the multinomial coefficient cancels when applying Bayes rule. 
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and then introduce a logarithm and an exponentiation function that allow us to compute 
the probabilities in a reliable way 



p{c\x) 



exp (logp(£c, c) — maxc" logp{x, c")) 
exp (logp(tc, cf) — maxc" logp(a;, c")) 



(25) 



4 Experimental Results 

4.1 The Dataset 

As corpus for our experiments we use the Reuters-21578, a collection of articles appeared 
in the Reuters newswire in 1987. More precisely we use the Modified Apte Split as 
described in the original corpus, consisting of a training set of 9 603 documents and a 
test set of 3 299 documents (the remaining 8 676 are not used). Although this partition 
originally intended to restrict the set of used documents to those with one or more well 
defined class labels (topics as they are called in the documentation), problems with an 
exact definition of what was exactly meant with ’topic’ results in documents without 
associated class labels appearing both in the training and the test set. Statistics of the 
corpus are shown in Table 1 . 



Table 1. Statistics for the Reuters-21578 dataset 



Number of documents 

Total No label Unilabel Multi-Label 
Training 9 603 1 828 (19.0%) 6552 (68.3%) 1223 (12.7%) 
Test 3299 280 (8.5%) 2581 (78.2%) 438 (13.3%) 



In spite of the explanation given in the “README” hie accompanying the dataset, 
we feel that the presence of unlabeled documents in the corpus is not adequate, as they 
seem to be the result of an incorrect labelling, and therefore should be eliminated of 
the test set. We report results with the whole set, however, in order to better compare 
our results with other researches. On the other hand, the presence of such documents in 
the training set does provide some useful information and can be considered as a “real 
life” situation, where only a subset of the available data has been labeled. In our case 
we use the unlabeled documents as an aid to better estimate the smoothing parameters, 
but can also be used in a more powerful way [7]. This will be the subject of further 
research. 

For the accumulated posterior probability approach, the presence of unlabeled sam- 
ples in the test set represents immediately a classification error, as the definition of the 
approach requires that at least one label to be detected. One possibility to avoid this prob- 
lem could be to include a “<no_class>” label, trained with the unlabeled samples 
in the training set and being mutually exclusive with the other classes. This seems 
however an ad hoc solution that does not generalize well so we decided not to 




Multi-label Text Classification Using Multinomial Models 



227 




(a) Accumulated Posterior Probability 




Recall 

(b) Binary Classifiers 

Fig. 1. Precision and recall curves for the Reuters-21578 dataset. Note the different scaling of the 
axis 



apply it. On the other hand, the binary classifiers can handle the case of unlabeled 
samples in a natural way, if none of the posterior probabilities lies above the predefined 
threshold.^ 



^ In the “normal” case where each sample should he labeled, we could choose the class with 
highest probability as the one unique label if no probability is higher than the threshold. 
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4.2 Results 

We will present present several figures as a measures of the effectiveness of our methods 
in order of increasing difficulty of the task. First we consider the simple unilabel clas- 
sification problem, that is, only the samples with one unique class-label are considered. 
We obtain an error rate of 8.56% in this case. If we include the non-labeled samples for 
a better estimation of the smoothing parameters we do not get any improvement in the 
error rate. 

In addition to the error rate, in the multi-label classification problem we also consider 
the precision/recall measure. It is worth noting that in most previous work, the error 
rate is not considered as an appropiate measure of the effectiveness of a multi-label 
classification system, as it does not take into consideration “near misses”, that is for 
example the case when all the detected labels are correct but there is still one label 
missing. This is clearly an important issue, but for some applications, specially when 
the classification system is only a part of a much bigger system (see for example [3]) 
such a “small” error does have a great influence on the output of the whole system, as 
it propagates into the subsequent constituents. Therefore we feel that the true error rate 
should also be included in such a study. 

In the case of multi-label classification, precision is defined as 



precision = 



# of correct detected labels 
# of detected labels 



(26) 



and recall as 

# of correct detected labels 

recall = (27) 

# of reference labels 

where “detected labels” corresponds to the labels in the output of the classifier. 

The curves shown in Figure 1 are obtained modifying the threshold T in the range 
(0, 1). Note that because of this method for generating the curves the axis ranges are 
quite different. We can observe that the accumulated posterior probability approach has 
a much higher precision rate than the binary classifiers, which, in turn, have a higher re- 
call rate. That means that the accumulated posterior probability approach does a “safe” 
classification, where the output labels have a high probability to be right, but it does 
not find all the reference class labels. The binary classifiers, on the other hand, do find 
most of the correct labels but at the expense of also outputting a big amount of incorrect 
labels. The effect of (not) including the non labeled test samples can also be seen in 
the curves. As expected, the performance of the accumulated posterior probability ap- 
proach increases when leaving this samples out. In the case of the binary classifiers, the 
difference is not as big, but better results are still obtained when using only the labeled 
data. 

It is also interesting to observe the evolution of the error rate when varying the 
threshold value. For the multi-label problem, for a sample to be correctly classified, the 
whole set of reference labels must be detected. That is, the number of detected labels 
must be the same in the reference and the output of the classifier (and obviously the 
labels also have to be the same). This a rather strict measure, but one must consider 
that in many systems the classification is only one step in a chain of processes and we 
are interested in the exact performance of the classifier [3]. The error rate is showed in 
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Figure 2. Note that this curves show the error rate on the test set in order to analyze the 
behavior of the classification methods. For a real-world classification system we should 
choose an appropriate threshold value (for example by using a validation set) and then 
use this value in order to obtain a figure for the error rate. 

We see that when considering the error rate, the accumulated posterior probability 
approach performs much better than the binary classifiers. For this approach the thresh- 
old does not have a great influence on the error rate unless we use high values, where 
an increase of the number of class labels the classifier has to include for reaching the 
threshold produces an increase of the error rate. Somehow surprisingly, for binary clas- 
sifiers, the best results are obtained for low threshold values. This is probably due to 
the unclean division between the classes defined in every binary subproblem, that leads 
to an incorrect parameter estimation. Taking into account the correlation between the 
classes may help to alleviate the problem. 




Fig. 2. Error rate for the two multi-label methods 



5 Conclusions 

In this paper we have discussed some possibilities to handle the multi-label classification 
problem. The methods are quite general and can be applied to a wide range of statistical 
classifiers. Results on text classification with the Reuters-21578 corpus have been pre- 
sented, where the accumulated posterior probability approach performs better that the 
most widely used binary classifiers. 

However, in these approaches we did not take the relation between the different 
classes into account. Modeling this information may provide a better estimation of 
the parameters and better results can be expected. For the Reuters-21578 corpus in 
particular, a better exploitation of the unlabeled data can also lead to an improvement in 
performance. 
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Abstract. AsdeCopas is a module designed to interface syntax and se- 
mantics. It is based on self-contained, hierarchically organised seman- 
tic rules and outputs formulas in a flat language. This paper extends a 
previous definition of semantic rules and describes two applications of 
AsdeCopas, namely in question interpretation and in semantic disam- 
biguation. 



1 Introduction 

In 1997, and in order to allow Natural Language access to a database of tourist 
resources, a linguistically motivated system called Edite [13, 23, 14] was created. 
Edite had a traditional syntax-semantic interface, where semantic rules were 
associated with syntactic rules and the semantic analysis was made by a bottom- 
up parser. Edite had a fast development, but it soon became saturated, mainly 
due to the dependency between syntactic and semantic rules. This experiment 
made us change our approach and invest in a more robust methodology. We 
found in the 5P Paradigm [4, 16, 5] the background we were looking for. The 
syntax-semantic interface presented in this paper reflects the effort of adapting 
to 5P demands. 

Firstly, our syntax-semantic interface builds a dependency structure from a 
surface analysis [10]. Then, a module called AsdeGopas [8,9,11], based on in- 
cremental, hierarchically organised semantic rules, generates logical formulas. 
Although, logical formulas generation was AsdeGopas original goal, it can also 
execute other tasks. In this paper we show how it can be used in question inter- 
pretation and in semantic disambiguation. 

This paper is organised as follows: in section 2 we present some related work; 
in section 3 we describe our interface, focussing on the semantic rules and Asde- 
Gopas; in sections 4 and 5 we detach the mentioned applications. Final remarks 
and future directions can be found in section 6. 
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2 Related Work 

Syntax-semantic interfaces, as the ones presented in [3] and [18] are largely 
spread. These interfaces are based on semantic rules associated with syntac- 
tic rules and typically semantic analysis is executed in a bottom-up analysis. 
On the contrary, syntax-semantic interfaces based on dependencies are not so 
easy to find. [21] is an answer extraction system that executes a syntax-semantic 
interface over dependencies. According to [20], the current version of ExtrAns 
uses either Link Grammar [15] or the Conexor FDG parser [17] to generate de- 
pendencies. Then, different algorithms are implemented over each one of the 
dependency structures. In the first case, the logical-form is constructed by a 
top-down procedure, starting in the head of the main dependency and following 
dependencies. The algorithm is prepared to deal with a certain type of depen- 
dencies, and whenever an unexpected link appears, a special recovery treatment 
is applied. When describing the algorithm, the authors say that most of these 
steps “... become very complex, sometimes involving recursive applications of the 
algorithm” and also that “specific particularities of the dependency structures 
returned by Link Grammar add complexity to this process” [20]. In the second 
case, that is, after the Gonexor FDG parsing, the bottom up parser used has 
three stages. In the first one (introspection) possible underspecified predicates 
are associated with each word. Object predicates introduce their own arguments, 
but other predicates remain incomplete until the second stage (extrospection). 
During extrospection, arguments are filled by looking at the relation between 
each word and its head. Sometimes dummy arguments need to be assigned 
when the algorithm faces disconnected dependency structures. A third stage 
(re-interpretation) is needed to re-analyse some logical constructs. According to 
the authors, the algorithm cannot produce the correct argument structure for 
long distance dependencies. 

In our proposal, semantic rules are set apart from the algorithm. As so, the 
algorithm is independent from the dependency structures. As semantic rules are 
sensitive for the (possibly non-local) syntactic context, long distance dependen- 
cies cause no problem and we are able to make semantic rules return precise 
semantic values that depend on the context. Moreover, each semantic rule con- 
tains all the necessary information to calculate the corresponding formulas. As 
a result, all words, independently from their category, are mapped into formulas 
in one step. Additionally, there is no need to recursively apply the algorithm. Fi- 
nally, semantic rules are organised in a hierarchy. Therefore, instead of creating 
dummy arguments when disconnected dependency structures are found, default 
rules are triggered in these situations. 

3 Our Proposal 

3.1 The Syntax-Semantic Interface 

The whole syntax-semantic interface is integrated in a system called Javali [7]. 
It has three modules: 
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— Algas and Ogre, that generate dependency structures from a surface analysis 
[10] (from now on we call “graphs” to these dependency structures); 

— AsdeCopas, that takes the graph as input and performs several tasks such 
as enriching the graph by adding labels to its arrows; calculating semantic 
values; generating logical forms, etc. 

Each graph node contains: (i) a word; (ii) the word’s category; (iii) the word’s 
position in the text; (iv) extra information about the word (this field might be 
empty) . 

Each graph arc maintains information about: (i) the position and the category 
associated with the source node; (ii) the position and the category associated 
with the target node; (iii) the arrow label (possibly undefined)^. 

As an example, given the question 

( ), the following graph 

is obtained, where the interrogative pronoun (qu) , arrows the common 

noun (cn) , which is also the target of the article , the proper noun 

(pn) and the cn . Moreover, the cn arrows itself and can 

be seen as the head of the question. Both prepositions (prep) and arrow 
the heads of the prepositional phrases they introduce, respectively, and 



node ( ’ Quais ’ , qu, 9, _) , arc(qu, 9, cn, 11, _) , 
nodeC’os’, artd_p, 10, _) , arc(artd_p, 10, cn, 11, _) , 
node ( ’hotels ’ , cn, 11, _) , arc(cn, 11, cn, 11, _) , 
node (’com’, prep, 12, _) , arc (prep, 12, cn, 13, _) , 
node ( ’piscina’ , cn, 13, _) , arc(cn, 13, cn, 11, _) , 

node(’de’, prep, 14, _) , arc (prep, 14, pn, 15, _) , 
node ( ’Lisboa’ , pn, 15, _) , arc(pn, 15, cn, 11, _) . 

3.2 Semantic Rules 

Incrementability and Specificity. Ai't-Mokhtar [2] defines an incremental 
rule as “a self-contained operation, whose result depends on the set of contextual 
restrictions stated in the rule itself. [...] If a sub-string matches the contextual 
restrictions, the corresponding operation applies without later backtracking”. 
Our semantic rules were designed having this property in mind. As a result, 
each semantic rule is divided into three parts: 

— the element or elements to transform; 

— the context of the elements to transform (it can be seen as a set of conditions 
that, being verified, indicate that the rule can be applied); 

— the output (specified by a set of functions that will transform the elements 
according to the chosen representation). 



^ Within our applications, dependencies are unlabelled, and go from dependents 
to the head. The motivation behind these structures came from the 5P 
Paradigm. 




234 



L. Coheur et al. 



Moreover, we assume the following: 

— if an element appears either as an element to transform or as part of the 
rule context, the semantic rule has access to all the semantic information of 
that element. That is, it knows its variables, handles, associated predicates, 
unspecified semantic values, and so on; 

— the function that perform the transformation can only use information from 
elements defined as elements to transform or that appear in the rule’s con- 
text. 

As a result, semantic rules are similar to the rules used in [1] and they can 
also be seen as incremental rules as previously defined. 

Additionally, semantic rules are organised in a subsumption hierarchy. There- 
fore, if a set of rules can be applied, only the rules that do not subsume other 
rules - that is the most specific rules - are triggered. This allows to add new, 
more specific information to the system without having to rewrite general rules. 

Syntax. Let W be a set of words, C a set of category labels, D a set of arrow 
labels and P a set of variables denoting positions. The symbol is used to 
represent an underspecified value. Consider the following definitions: 

— Element: elem('u;, c, p) is an element, where: (i) w G {_} U W; (ii) c G {_} U 
C; (iii) p G P. 

— Arrow: arrow(wi, ci, pi, W2, C2, P2, d, 1) is an arrow, and 
no_arrow(wi, ci, pi, W2, C2, P2, d, 1 ) a non existing arrow where: (i) wi, 
W2 G {-} U W; (ii) ci, C2 G C (ci and C2 are, respectively, the arrow source 
and target); (iii) Pi,P2 G P; (iv) d G {_} U {L, R} {d is the arrow orientation: 
L if it goes from right to left, R from left to right); (v) / G {_} U D (Hs a 
possibly undefined arrow label). 

— Semantic Rule: [Ri] E : 0 F is a semantic rule where: (i) A is a possibly 
empty set of elements (the elements to operate on); (ii) 0 is a possible empty 
set of existing and non existing arrows (the rule’s context); (iii) T is a set of 
transformation functions, that vary according to the chosen representation 
language. 

Semantic rules presented here are an extension of semantic rules presented 
in [11], where there is no position associated with elements and arrows have no 
words in its arguments. By associating (variables over) positions to elements, 
there is no need to use indexes to distinguish two different objects with the 
same category. Introducing words in the arrows arguments allows to define more 
precise semantic rules. 

Extra constraints over semantic rules syntax can be found in [8, 9]. 

For now, and because of the applications from sections 4 and 5, we introduce 
the following transformation functions: (i) (Xi): returns the (default) pred- 

icate of the element in position Xi; (ii) : returns a variable associtated with 
the element in position Xi. 
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Hierarchy of Rules. In the following, we define the subsumption relation be- 
tween semantic rules. This relation establishes the hierarchy of rules and it is 
based on the subsumption relation between categories (notice that, although 
we use labels to represent categories, each category is a set of attribute/value 
pairs organised in a subsumption hierarchy). Due to the syntactic extensions of 
semantic rules, the hierarchy relation defined below is also an extension of the 
hierarchy relation presented in [11]. 

— Element subsumption: Given ei = elem(wi, ci, pi) and 62 = elem(w2, C2, P2) 

from S, Cl subsumes 62 (ei Cg £2) iff (i) Ci G C2; (ii) (wi yf _) {w2 = Wi). 

— Arrow subsumption: Given oi = arrow('u;i, ci, pi, W2, C2, P2, di, h) and 

G2 = arrow(w3, C3, ps, IU4, C4, p4, d2, h) from 0 , ai subsumes 02 (oi Ga 02) 
iff (i) (wi yf -) ^ (w3 = wi); (ii) {1V2 -) ^ (w4 = ^2); (ih) ci C C3 A C2 

E C4; (iv) (di y^ _) [d,2 = di); (v) (^i yf _) {I2 = h)- 

— Subsumption of non existing arrows: Given 

ai = no_arrow(t(;i, ci, pi, W2, C2, P2, di, h) and 

02 = no_arrow('u;3, C3, p^, IU4, C4, p4, d2, h) from 0 , oi subsumes 02 
(oi Ea 02) iff (i) {wi yf -) ^ (w 3 = wi); (ii) {w2 -) ^ (w4 = ^2); (hi) ci 

E C3 A C2 E C4; (iv) (di yf -) ^ (d2 = di); (v) (E yf _) ^ {I2 = h)- 

— Rule subsumption: Given two semantic rules Ri = {Si, 0 \, Ji) and 

R2 = {S2, 02, G2), Ri subsumes R2 (Ri Er R2) iff (i) (V ei G Si ){3 62 
G T'2)(ei Ee 62); (ii) (V oi G 6>i)(3 02 G 02)(ai Ea 02). 

If Ri subsumes R2, R2 is said to be more specific than Ri. If both rules can 
apply, only the most specific one does so. 

3.3 AsdeCopas 

AsdeGopas is implemented in Prolog. It goes through each graph node and: (i) 
identifies the rules that can be applied; (ii) chooses the most specific rules; (iii) 
triggers the most specific rules (notice that at each step more than one rule can 
be triggered). 

Within AsdeGopas we detach the following: 

— The order of rule’s application is not relevant as results remain the same, 
independently from the order of rule’s application; 

— AsdeGopas controls variable generation as instead of randomly generating 
variables, each variable is indexed by the position that the related word 
occupies in the text. By this we guarantee that each semantic rules “knows” 
the variables associated with each element to transform or appearing in the 
context^. 

As an example of AsdeGopas output, see the representation obtained for the 
question 



^ Although we have not yet used this possibility, this control of variables should allow 
us to run different semantic processes at different times and merge the results at the 
end. 
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RIO: ?xll, R1 : hoteis(xll), R7 : com(xll, xl3) 

R1 : piscina(xl3) , R7 : de(xll, xl5) , R2: NAME(xl5, Lisboa) 

In the next section we detach the rules responsible for this result. 



4 Task 1: Question Interpretation 

In Task I AsdeCopas is used to generate logical forms from questions. Quantifi- 
cation is ignored. 

4.1 Semantic Rules 

In the following we present a simplified set of the semantic rules that we 
use in this task. Namely, we present rules associated with names, adjectives 
and interrogative pronouns. These rules illustrate how logical forms are con- 
structed. 

Let us assume that the representation of a cn is a unary formula. For 
example, the cn ( ) originates the formula praia(x). The next rule 

calculates the cn representation. Notice that this rule applies independently 
from the arrowing context of the cn. 

[Ri] elem(_, cn, XQ : 0 i-^ (Xi)( (Xi)) 

Consider now a pn. Each pn originates a binary formula, with predicate NAME 
as suggested in [18]. The first argument is the variable of the named entity and 
the second argument is its name. The variable used for the named entity de- 
pends if the pn arrows a verb (v) or a cn, because we want 
( ) to be represented as NAME(x, Ritz) , 

tem(e,x,y), piscina(y) ) and ( 

)ashotel(x), NAME (x, Ritz) , tem(e,x,y), piscina(y). 
The following rules cover both situations. R2 is a generic rule and R3 is more 
specific and is subsumed by R2. 

[R2] elem(_, pn, XQ : 0 NAME( (XQ, (XQ) 

[R3] elem(_, pn, XQ : {arrow(_, pn, Xi, _, cn, X2, _, _), 

no_arrow(_, prep, X3, _, pn, Xi, R, _)} 1-^ {NAME( (X2), (Xi))} 

Consider now the rules for adjectives (adj). We assume that an adjective 
may originate either a unary formula or a binary formula with predicate AM (as 
suggested in [18]). The first representation is the default representation; the 
second representation is used when the adj arrows a noun n (either cn or a pn). 
The followings rules cover these cases (and R5 is subsumed by R4). 
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[R4] elem(_, adj, Xi) : 0 1-^ (Xi)( (Xi)) 

[Rg] elem(_, adj, Xi): 

{arrow(_, adj, Xi, _, n, X2, _, _), no_arrow(_, prep, X3, adj, Xi, R, _)} 1-^ 
{AM( (X2), (Xi)), (Xi)( (Xi))} 

By applying these rules, the adj ( ) in ( 

) is transformed into nudista(x) and the adj ( ) from 

( ) in AM(x, y) , bonita(y), where x is the 

variable associated with 

Consider now the rule for prepositions. We assume, as usual, that a prepo- 
sition results in a binary formula, where the first argument is the head of the 
phrase to which the prepositional phrase is attached, and the second the target 
of the preposition’s arrow. The first argument can be an event (rule Rg) or an 
entity (rule R7). The following rule 

[Rg] elem(_, prep, Xi) : {arrow(_, prep, Xi, _, n, X2, R, _), 
arrow(_, n, X2, _, n, X3, _)} 1-^ { (Xi)( (X3), (X2))} 

transforms the preposition from ( ) into 

prepCx, y), where x is the variable associated with and y with 

The next rule covers the verbs situation: 

[R7] elem(_, prep, Xi) : {arrow(_, prep, Xi, _, n, X2, R, _), 
arrow(_, n, X2, _, V, X3, _, _)} 1-^ { (Xi)( (X3), (X2))} 

This rule creates the formula para(e , x) from the phrase 
( ), where e is the event associated with the verb vai and x with 

In order to represent the interrogative pronouns , , and 

(labelled qu), we use the formula ?x, where x is a variable representing the 
objects we are asking for. 

[Rs] elem(_, qu, Xi) : {arrow(_, qu, Xi, _, n, X2, R, _)} 1-^ {? (X2)} 

With rule Rg, the qu from results in the formula ?x, where x 

is the variable associated with 

4.2 Paraphrases 

A special effort was made in this task to solve some paraphrasic relations. As an 
example, both phrases ( 

) and ( 

), result in a formula (similar) to: 

?x, hoteis(x), tem(e, x, y) , piscina(y) 

Notice that in order to reach this result, we only looked into the particular 
syntactic conditions that make verb behave as the verb ter. 
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4.3 Evaluation 

Consider that: 

— a “system representation” (SR) is the set of formulas that the system sug- 
gests as a represention of the question; 

— a “correct representation” (CR) is the set of formulas representing a question, 
where the exact number of expected predicates are produced, and variables 
are in the correct places. 

For example, being given the question 

( . ), its CR is: 

?x759 , roteiros(x759) , AM(x759, x760) , pedestres (x760) 

AM(x759, x761), sinalizados (x761) 
em(x759, x763) , NAME(x763, Lisboa) 

Nevertheless, as the word was not understood by the system, it 

generates the following SR: 

7x759, roteiros (x759) , AM(x759, x760) , pedestres (x760) 
em(x759, x763) , NAME(x763, Lisboa) 

From system Edite we inherited a corpus of 680 non-treated questions about 
tourist resources. A first evaluation made over 30 question is presented in [11]. 
In that evaluation AsdeCopas input was a set of graphs that did not necessarily 
represent the correct analysis. Moreover, as in previous steps several graphs were 
associated with each question, each question had more than one graph associated 
and consequently, each question had more that one SR. In that evaluation we 
obtained a precison of 19% and a recall of 77%. Nevertheless, in that evaluation, 
as incorrect graphs were accepted, we were not really evaluating AsdeCopas, but 
the whole system. Besides, we did not measure the distance between the obtained 
SR and the CR, which can be a extremelly useful evaluation. For instance, in 
the previous example, most of the information from the CR is in the SR. 

As so, we propose here a new evaluation where the correct graph is chosed, 
and new metrics are used. Notice that if only one graph is associated with each 
question, as we admit underspecified semantic values, with the actual set of 
semantic rules, only one (underspecified) SR is obtained. Consider that: (i) SF 
is a formula, such that SF G SR (system formula); (ii) CF is a formula, such that 
CF G SR n CR (correct formula); (iii) NCF is the number of (correct) formulas 
within a CR (number of (correct) formulas in a correct representation). 

We use the following metrics for each SR: (i) precision: number of CF /number 
of SF; (ii) recall: number of CF/NCF. 

In an evalution over 50 questions, we obtained 45 SR such that precision = 
recall = 100%. The remaining 5 SR, were evaluated according with the previous 
presented metrics and results are the following: 

From these results we can conclude that the SR is very close to the CR. That 
is, even if we are not able to calculate the exact result, we can calculate a set of 
formulas that are very similar with the wanted results. 
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Table 1. Evaluation results 



Question number 


Precision 


Recall 


23 


0,86 


1 


28 


0,85 


0,85 


37 


1 


0,92 


40 


1 


0,8 


42 


0,86 


0,75 



5 Task 2: Semantic Disambiguation 

5.1 The Problem 

As the majority of the words, the word (roughly, , ) can take several 

semantic values. Sometimes the syntactic context allows to choose the correct 
semantic value, sometimes only semantic information lead to the correct value, 
sometimes the correct semantic values is impossible to calculate. 

In this task, we try to identify subsets of the set of the semantic values 
that might take, by analysing its syntactic context. Semantic rules, as 

defined in 3.2, will be used in this task. 

Additionally, as other quantifiers can also take the same semantic values 
of , we also try to identify these values and the associated syntactic 

contexts. The following examples illustrate the difficulty of the task. 

( 1 ) 

( - - ) 

This sentence is equivalent to 

( 2 ) , 3 , 

Nevertheless, (1) is also equivalent to 

( 3 ) 

but (2) is not equivalent to 

which is not grammatical. To complicate, (1), (2) and (3) are equivalent to 

( 5 ) 

But not to: 

( 6 ) . 4 , 

because (1), (2), (3) and (5) are true only if David does not know all the 
Tolkien books, and (6) is true if David does not know one book from Tolkien. 
Moreover, (1), (2), (3) e (5) are equivalent to: 

( 7 ) 



3 Notice that this sentence could also be interpreted as if David only knew one book 
from Tolkien. 

^ Comparing (1) and (6), todos and qualquer, that seems to have the same meaning, 
have different values in the same syntactic context. 
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5.2 Semantic Values 

In the following examples we show some of the semantic values that 
can take (see [19] for a detailed discussion about the multiple interpretations of 
): 

— In ( ) it has an universal value 

(univ); 

— In ( ) it has an 

existential value (exist); 

— In ( ) it is an 

adjective, and it means something like “with no relevant characteristics in 
the class denoted by the noun it qualifies”. We will denote this semantic 
value as indiscriminate; 

— In ( , ) 

it has the same indiscriminate value. 



5.3 Semantic Rules 

In this section we show the kind of rules we have in [6] and [9], where we try to 
go as far as possible in the disambiguation process of , ( )? 

( ) and ( ), by using its syntactic context. First let us 

define a default rule. The default semantic value, can be an underspecified value 
[22] representing all of the semantic values, or a default value. For example, the 
universal value since it is the most common. Let us opt for the universal default 
value and write the default rule: 

[Ri] {elem( , qt, Xi)} : 0 { (Xi) = univ} 

Assuming that on the right of the main verb, in the scope of negation, 
takes the value indiscriminate, the following rule allows to choose 
the correct value for in that context: 

[R2] {elem( , qt, Xi)|: {arrow( , qt, Xi, _, n, X2, L, _), 

arrow(_, n, X2, _, v, X3, L, _), arrow(_, neg, X4, _, v, X3, R, _)} 1-^ 

{ (qt) = indiscriminate} 

R2 is more specific than rule Ri, thus it is applied in these particular con- 
ditions. In order to disambiguate, or at least to limit semantic values, other 
semantic rules would have to be added. 

To conclude this section, notice that a traditional syntax-semantic interface, 
operating in a bottom-up process is not able to execute this task. Considering 
that on the right of a main verb in the scope of negation, takes the 

indiscriminate semantic value. Typically, in a bottom-up parsing (Figure 1) 
we will not be able to discard unnecessary values, as in point (1), when finally 
we have the whole vision of the subtree, the semantic rule will not take into 
consideration the negation inside V. 
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S -> NP VP 
VP -> V NP (1) 

V -> neg V I V I ... 
neg -> nao 

V -> e I ... 

NP -> art n qt I art n I 
art -> um I ... 
n -> jornalista I ... 
qt -> qualquer I ... 



NP 




Fig. 1. Grammar and qualquer example 



6 Conclusions 

We presented AsdeCopas a multi-task module based on incremental rules, hi- 
erarchically organized and we apply it to question interpretation and semantic 
desambiguation. 

In the present time, we are using AsdeCopas to map Portuguese statements 
into Minimal Recursion Semantics [12]. Quantification is being carefully studied 
in this task that had a preliminar presentation in [11]. 
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Abstract. The Multi-word Expressions (MWE) treatment is a very dif- 
ficult problem for the Natural Language Processing in general and for 
Machine Translation in particular. This is true because each word of a 
MWE can have a specific meaning but the expression can have a totally 
different meaning both in source and in target language of a translation. 
The things are complicated also by the fact that the source expression 
can appear in the source text under a very different form from its form in 
a bilingual MWE dictionary (it can have some inflections) and, most of 
all, it can have some extensions (some MWE words can have associated 
new words that do not belong to the MWE). The paper show how this 
kind of problems can be treated and solved using Generative Dependency 
Grammar with Features. 



1 Introduction 

The translation problem of Multiword Expressions (MWE) is one of the most dif- 
ficult problems of machine translation and even human translation. As [15] says, 
the MWE are ”a pain in the neck of natural language processing (NLP)”. There 
are many types of MWE classified on different criteria [15] [5]. MWE appear 
under different names and interpretations: collocations, compound words, co- 
occurrences, idioms, fixed syntagmas, lexical solidarities, phraseologisms, phrase- 
olexemes, polylexical expressions [5]. Not all these names are fully equivalent but 
they have something in common: the fact that they are concerned with groups 
of words that must always be considered as a whole. The words that compose 
a MWE must have a representation that can indicate the fact that in an MWE 
instance into a text they can have eventually different forms (can be inflected) 
or even they can be replaced by other words that belong to some lexical classes. 
The MWE representation is specific to each grammar type: HPSG [5], Lexicon 
Grammar [7], Tree Adjoining Grammar [18] etc. 

The most important problems of the MWE are followings: 

a) The means of the MWE can be very different from the sense of each word 
of the MWE (the so called compositionality [1] or idiomaticity problem [16]). 

John . . . . 

b) MWE can appear with its word on different inflected forms (morphological 

variation [16] [17]). John the light. John the 

light. 
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c) MWE appears under different form by passivization, topicalization, scram- 
bling, etc. (syntactic modification [1] or structural variation [17] [18]). 

He me, and then I him. [13] 

d) MWE appears with different extensions attached to some words of the ex- 
pression (modifications [18]). Exemple: II .(He is exhausted.)) 

II . (He is illy exhausted.) 

e) MWE appears with different inserted word that some times are not directly 

linked with the expression words. He John . He 

a lot of people 

f) MWE appears with different words that have similar sense (lexical varia- 
tion [17] [18]). John ( , ) [13]. 

Solutions for these problems are not easy to find and perhaps, the great 
number of names of MWE are due (partially) to the fact that each sort of 
representation solves some aspects of the problems. 

In this paper we will show a very general method to MWE representation 
based on Generation Dependency Grammar with Feature (GDGF) [2] [3] that 
solves many of these problems and can be used in machine translation (MT). 



2 Representation of MWE in GDGF 

We will show in an informal manner how MWE can be represented in GDGF. 
A more formal definition of GDGF can be found in [2] [3]. Let us have E^ a 
source expression from a text (phrase) in the language with a lexicon Lg 
and E( a target expression (equivalent with Eg) in the translation T* (of the 
text Tg) in the language with the lexicon L(. Eg and Ej can have a different 
number of words and sometimes with different senses (if we take the words 
one by one). In a bilingual phraseological dictionary (or MWE dictionary) that 
contains the correspondences between the expressions belonging to two languages 
Lg and L*, EDg and ED* are the expressions in a basic form (’’lemmatized form”) 
corresponding to Eg and E^. Eg is different from EDg (it can be inflected or can 
have some insertions of other words, linked or not with the words of Eg by 
dependency relations. We will note E^, the extension of the EDg in Eg. E^, will 
contain all the words that have dependency relations with the word from Eg. In 
the translation process Eg must be searched in the MWE dictionary (though it 
is not identical with EDg) and the extension E^, must also be considered in order 
to create a similar extension for Ej. 

The texts Tg (and Eg) and the entry EDg (with its equivalence ED*) in MWE 
dictionary will be represented in GDGF. 

A GDGF is a rule collection. Each rule has in the left hand a non terminal 
and in right hand a syntactic part and a dependency part. The syntactic part 
is a sequence of non-terminals/terminals/pseudo-terminals/procedural-actions 
( ). Each from the syntactic part have associated an Attribute Value 

Tree (AVT) that describes the syntactical/lexical categories and their values. 
The dependency part is a dependency tree (dependencies between the -s 
from the syntactic part) . Terminals are words that belongs to the lexicon (Lg or 
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Lt in our case). Non-terminals are labels (symbols) for grammar categories (and 
are considered to not belonging to the lexicons Lg or Lj). Each non-terminal 
appears at least one time in the left side of a rule (with the exception of one 
symbol named root symbol). A pseudo-terminal is a sort of non-terminal that 
does not appear in the left side of a rule but it can be replaced directly by a 
word from the lexicon. A procedural-action (or action) is a routine that must be 
executed in order to identify the symbols from the text (for example a routine 
that identify a number). 

In the MWE dictionary, an Eg (and the corresponding Ej) is represented 
only by one rule so it will not contain but only terminals, pseudo-terminals, 
actions ( ). The left part of the right side of this rule will contain only -s 

and the right part of the right side (the dependency tree) will be a final tree. 

Until now all the GDGF description is conform to [2] [3]. We will introduce 
now something that is specific to the MWE treatment. 

Each from EDg will have associated a ’’type of variability”. There are 
three types of variability: invariable, partial variable and total variable. 

a) Invariable . A is invariable, if it appears always in any context in 
the same form (so it will have associated always the same AVT or an AVT that 
subsumes the corresponding AVT from the MWE dictionary). 

b) Partial variable . A is partial variable, if it has the same lemma 
(and lexical class) in all the apparitions of the expression in different contexts, 
and it has different lexical categories (it can be inflected). 

c) Total variable . A is total variable, if it is any word (accepted by the 
grammar in the corresponding position), in all the apparitions of the expression 
in different contexts. More of this, in the source text, this can be also a 
coordinate relation. 

Let us have the Romanian expression ” 

” and an English equivalent ” . . ” . The grammar rule 

for the Romanian expression is the following (we will do not indicate completely 
the AVTs): 

<EDg> -> (” ” [class = verb] ” ” [class = preposition] ” ” [class = 

pronoun] [pronoun type = indefinite] ” ” [class = preposition] ” ” [class 

= noun] [gender = feminine] ” ” [class = pronoun] [pronoun type = demon- 
strative] [gender = feminine] [gender = feminine] ” ” [class = adjective] 

[gender = feminine], ” ” ( @ri@( ” ”( @r 2 @( ” ”()))), @r 4 @( ” 

”(@r5@(” ”(@r6@(” ”0), @r7@(” ”()))))))) 

(Observations: In the rule, the terminals are lemmas. The inflected form is 
indicated by AVT. That is why the word ” ” from expression appears as ” ”.) 

The relations can have more intuitive names like ’’relations between noun 
and attribute” but we used only generic names like ri, r 2 , etc. 

The grammar rule for the corresponding English expression is: 

<EDt> -> ( ” ” [class = verb] ” , ” [class = pronoun] [pronoun 

type = indefinite] ” , ” [class = adverb], ” ”( @rs@( ” . ”()), 

@rg@( ” , ” ()))) 
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We will define a correspondence between () and , (), both being 

of total variable type: 

Correspondence () / , () 

The Romanian expression can appear in a context like: ” 

” . The corresponding final rule will be: 

<Ts> = (...” ” [class = verb] [mode = indicative] [person = III] [number 

= present] [time = perfect] ” ” [class = preposition] ” ” [class = noun] [noun 

type = proper] ” ” [class = conjunction] ” ” [class = noun] [noun type = 

proper] ” ” [class = preposition] ” ” [class = noun] [number = singular] 

” ” [class = pronoun] [pronoun type = demonstrative] [gender = feminine] 

[number = singular] ” ” [class = adjective] [number = singular] [gender = 

feminine]...,...” ”( @n@( ” ”( @r 2 @( @r 3 @( ” ” ”()/)))), 

@r4@(” ”(@r5@(” ”(@r6@(” ”()),@r7@(” ”()))))))...) 

In order to find the expression ” ” 

in the source text we must make an equivalence between ” ”() and @ra@ 

( ” ” 0, ” ”0 / ) that is a conjunctive relation represented as a coordinate 

relation. This is possible if we declare ” ”() as ’’total variable”. In the 

target expression, the terminal somebody() will be substituted by the coordinate 
relation @r 3 @( ” ” (),” ”()/). 

3 Transformation Rules 

Between the nodes (of type terminal / pseudo-terminal / action / relation - ) 

from the two expressions EDg and EDj we will indicate some transformation 
rules. There are two types of transformation rules (see figure 1): 

a) Transformation rule of type ’’correspondence” between two nodes indicates 
that the two -s are relatively equivalent in the two expressions. This means 
that from EDj, during the translation process, will take over all the links 
(with their descendants) corresponding to from Es (that contains the links 
from EDg but also the links from the extension E^, of EDg in Tg). We will consider 
the following convention: if between two -s from the two expressions EDg 
and ED( exists a correspondence, then they are of the same invariable, partial 
variable or total variable type. A source or a target can appear at most 
in a transformation of type correspondence. 

b) Transformation rule of type ’’transfer” between i from EDg and 2 
from EDt indicates the fact that the two -s are not equivalent in the two 
expressions but, if 1 from EDg has some links that goes to the extension E^, 
of the expression Eg (that corresponds to EDg), then all these links with their 
descendents will be taken over by 2 from ED* (in order to obtain Ej). A 
transfer can not be defined between two relations. 

A source that appears in correspondences or transfers is said to have 
nonexclusive relationships, i.e. it can have in Tg some links that go to the ex- 
tension Ea; of Eg. 
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Fig. 1. Correspondences and transfers treatment 

A source that do not appear in correspondences or transfers is said to have 
exclusive relationships, i.e. it can not have links in Tg that go to the extension 
Eg;. These names are explained by the followings: 

- If a source is nonexclusive then it can have some descendents in the 
extension of the expression and these descendents must be attached (by corre- 
spondence or transfer) to a target from the target expression, so, in this case, 

must have a correspondence or a transfer. 

- If a source is exclusive, then it can not have descendents in the extension 
of the expression and it is not useful to put it in a correspondence or a transfer. 

Not always is possible to transfer the elements from an extension of the source 
expression to others nodes of the target expression. An important feature that 
we can associate to the that is the target of a correspondence or transfer is 
the . Let us have a coordinate relation r^, = ©relation name© 

(a conjunction relation ’’and” for example) and a St from target expression 
subordinated to another tt by the relation r*. We consider now the Sg 
from the extension of the source expression subordinate by the relation rg so 
that eq(rg) = r^ (the equivalent of rg from the source language is r* in the target 
language) to another tg from source expression. We suppose that between 
tg and tt was defined a ’’correspondence” or a ’’transfer”. We suppose also that 
the equivalent of tg from the source expression is eq(tg) in the target expression. 
In this case the translation of the source expression together with its extension 
using the MWE dictionary will be made as follows: r^, will be subordinated to 
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it, eq(Ss) will be subordinated by eq(rg), St will be subordinated by rj and eq(rs) 
and Tt will be coordinated by We will give later some examples to make the 
things more clear. 

We make also the following important observation: the description in the 
MWE must be done so that the heads of the source and target expressions are 
always of the same type (both are relations or both are -s) . 

4 Transformation Cases 

Besides the above notations we will use also the following notations too (see 
figure 1): 

Ng = A source node from Eg (it is of type ). 

Sg = The set of nodes (of type or coordinated relations ) linked to Ng 
by subordinate relations belonging to the extension E^; of Eg . 

ng = The number of nodes from Sg. 

Nj = A target node from (it is of type ). 

St = The set of nodes linked with Nt by subordinate relations and formed by 
the nodes that come from the MWE dictionary and by the nodes obtained by 
previous transfer or correspondence operations. 

nt = The current number of nodes from St. 

NDg = The source node from MWE dictionary corresponding to Ng. 

NDt = The target node associated by correspondence or transfer to NDt. 

r^; = A relation defined for a combination. 

type(s) = The type of s (where s is a node) i. e. the tuple formed by 
lexical class of s and the relation by which it is subordinated to another node. 

= Subordinate relation. 

= Coordinate relation. 

We will have type(sg) = type(st) if: 

- eq(sg) has the same lexical class with St; 

- if Sg is subordinated by rg and St is subordinated by r* then eq(rg) = r*. 

We will have type(Sg) # type(st) (# means not equal) if: 

- eq(sg) has the same lexical class with St; 

- if Sg is subordinated by rg and St is subordinated by r* then eq(rg) # r^. 

We will have Sg ~ St if: 

- eq(sg) = St 

- if Sg is subordinated by rg and s* is subordinated by r^ , then eq(rg) = r^. 

Using these notations we will present some transformation cases. 

The transformations by correspondences or transfers using eventually the 
combination feature are used in order to recognize in the source text not only 
the MWE but also the MWE that have different extensions and in order to 
translate these extended MWE more appropriately. For example the expression 
”. ” in Romanian language that has the English equivalent ” 

” will be recognized also under the form ”. ” and 

translated under the form ” ” if between ” ” and 

” ” will be indicated a correspondence. 
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The correspondence, transfers and combinations can be defined only for 
and not for relations. 

A source element of a correspondence or a transfer can be: 

- with descendents; 

- with descendents; 

- without descendents. 

A subordinate relation can be: 
with descendents; 

with descendents. 

A coordinate relation can be: 
with only descendents; 

- with only descendents; 

- with and descendents. 

By combining these possibilities for the source and for the target element we 
will have a lot of situations. Not all of these situations are useful in practical 
cases. 

When an element A of a source expression is in a correspondence or a transfer 
relation with an element B of an equivalent target expression, the descendents of 
A that go to the extension of the expression in which A appears must be attached 
as descendents of B (by its equivalents) . But B can have also some descendents. 
In this case we must establish a method to put together the descendents of A 
and the descendents of B. 

We will consider the next transformation cases: 

) There are many situations: 

The node s^ is of type and is linked with the 
node Ng by subordinate relation r^. For each Sg belonging to Sg that have an s* 
belonging to S* so type(sg) = type(st) but eq(sg) # St we will do the followings: 

- Create a new coordinate relation with two entries Tx (1) and Tx (2). 

- Mark that the equivalent of Sg is coordinated by r^, (1). 

- Mark that s* is coordinated by r^, (2). 

- Mark that r^, is linked with Nj. 

This type of treatment is named 

The equivalences of the node Sg and of its eventual descendents will be trans- 
ferred in the target language one by one. 

The node Sg is of type and is linked with the 
node Ng by subordinate relation rg. For each Sg belonging to Sg so there is not 
an St belonging to S* and type(sg) = type(st) we will mark the fact that eq(sg) 
is subordinated by eq(rg) to N*. 

This type of treatment is named 

The node Sg is of type and is linked with the 
node Ng by subordinate relation rg. For each Sg belonging to Sg so there is not 
an St belonging to S* and Sg ~ s* we will do the followings: 

- Sg is subsumed by s* and it will be ignored. 

- rg is subsumed by rg and it will be ignored. 

This type of treatment is named 
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Fig. 2. Example of MWE dictionary entry 



The node is of type coordinate relation and is 
linked with the node by a subordinate relation. For each Sg belonging to Sg 
we will mark that eq(sg) is subordinated by eq(rg) to Nj. It is as in case (a2) an 

) In this case the A descen- 

dents will be passed to B without conflicts. 

) In this case the problem 

disappears because we have nothing to pass from A to B. B will remain with its 
descendents. 

) In this case the problem disappears. 

This cases cover if not all the situations but a very large number of possible 
situations. 

Let us have the expression ” , , ” in Romanian language and 

the equivalent expression ” ” in French language 

(”to be liked by someone” or literally ”to be in the graces of someone”). 

The grammar rule for the Romanian expression is (we will do not indicate 
completely the AVTs) (remember that the terminals are lemmas): 

<EDg> -> (” . ” [class = verb] ”, ” [class = preposition] ” ” [class = 

nom] [gender = feminine] [article = articled] [article type = definite] ” ” 

[class = pronoun] [pronoun type = indefinite] [case = genitive], ” , ”( @ri@ 

(”. ”(@r 2 @(” ”(@3@(” ”()))))))) 

The grammar rule for the corresponding French expression is: 

<EDt> -> (” ” [class = verb] ” ” [class = preposition] ” ” [class = 

article] [article type = definite] [gender = feminine] [number = plural] ” ” 

[class = adjective] [gender = feminine] [number = plural] ” ” [class = nom] 
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[gender = feminine] ” ” [class = preposition] ” ” [class = pronoun] [pronoun 

type = indefinite], ” ”( @r4@( ” ”( @r5@(( ’’grace” ( @r7@(” ”()), @re@ 

(” ”()),@r 8 @(” ”(@rg@(” ”()))))))))) 




Fig. 3. Example - principle 



In the MWE dictionary something like in the figure 2 will appear. A cor- 
respondence will be defined in MWE dictionary between ” , ” (to be) and 

” ” and between ” ” (grace) and ” ” . A combination with the rela- 

tion ” ” (and) will be associated with the target ” 

Let us have now an expression with an extension ” , . 

” (literally ”to be in the languishing graces of someone” ) .The grammar 
rule for this expression is the following: 

<Exp> -> ( ” . ” [class = verb] ”. ” [class = preposition] ” ” [class 

= nom] [gender = feminine] [number = plural] [article = articled] [article type 
= definite] [number = plural] ” ” [class = adjective] [gender feminine] 

[number = plural] ’’cineva” [class = pronoun] [pronoun type = indefinite] [case 
= genitive], ” , ”( @ri@( ”. ”( @r2@( ” ”( @ro@( ” ”()), @r3@ 

(” ”()))))))) 

The general transformation case (of type (al)) can be represented as in the 
figure 3 . By applying this transformation case we will obtain the final translated 
rule (see figure 2 ): 
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<Exp.’> -> (” ” [class = verb] ’’dans” [class = preposition] ” ” [class = 

article] [article type = definite] [gender = feminine] ” ” [class adjective] [gender 

= feminine] [number = plural] ” ” [class = conjunction] ” ” [class = 

adjective] [gender = feminine] ” ” [class = nom] [gender = feminine] ” ” 

[preposition] ” ” [class = pronoun] [pronoun type = indefinite], ” ”( @r 4 @( 

” ”( @r 5 @(” grace” (@r 7 @(” ” ()), @r6@( @rx@( ”bon” (), ” ”())), 

@r8@(” ”(@rg@(” ”()))))))))) 




The reconstruction of the surface text ” 

” will be done using a GDGF grammar (that is a reversible grammar 
[2]) of the target language and a mechanism to force the agreement between 
a subordinate and a terminal to which the subordination is realized [4]. This 
mechanism says the followings: if the node B is subordinated by a relation to 
the node A, then the attribute values of B will be forced accordingly with the 
attribute values of A as described in the agreement sections of the grammar 
rules. 
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5 Conclusions 

We presented how GDGF solve some problems of MWE treatment. We added to 
the GDGF formalism the notions of correspondence, transfer, combination and 
variability of MWE components (invariable, partial variable, total variable) that 
allow the MWE description and the MWE dictionary construction. Using partial 
variability and the total variability, the method is important not only for MWE 
but also it can be used to create a sort of dictionary of grammatical structures. 
So, we can put in correspondence some specific structures of the source language 
and the target language. The ideas presented in the paper were introduced in a 
language GRAALAN (Grammar Abstract Language) that contains many other 
features, all of them based on GDGF. This language is designated to describe pair 
of languages in order to make possible the machine translation between them. 
GRAALAN is now used to describe Romanian language and the correspondences 
with French and English language. 
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Abstract. In this paper we investigate the way of combining different 
taggers to improve their performance in the named entity recognition 
task. The main resources used in our experiments are the publicly avail- 
able taggers TnT and TBL and a corpus of Spanish texts in which named 
entities occurrences are tagged with BIO tags. We have defined three 
transformations that provide us three additional versions of the training 
corpus. The transformations change either the words or the tags, and 
the three of them improve the results of TnT and TBL when they are 
trained with the original version of the corpus. With the four versions of 
the corpus and the two taggers, we have eight different models that can 
be combined with several techniques. The experiments carried out show 
that using machine learning techniques to combine them the performance 
improves considerably. We improve the baselines for TnT value of 

85.25) and TBL (F/3=i value of 87.45) up to a value of 90.90 in the best 
of our experiments. 



1 Introduction 

Named Entity Extraction (NEE) is a subtask of Information Extraction (IE). 
It involves 1) the identification of words, or sequences of words, that make up 
the name of an entity and 2) the classification of this name into a set of cate- 
gories. These categories are predefined and they conform what we call the domain 
taxonomy. For example, if the domain taxonomy contains the categories PER 
(people), ORG (organizations), LOG (places) and MISG (rest of entities), in the 
following text we find an example of each one of them: 

El presidente del , , se sumo hoy a las 

alabanzas vertidas por otros dirigentes deportivos en 

sobre la capacidad de esta ciudad para acoger a medio plazo unos 

The words “Juan Antonio Samaranch” conform the name of a person, the 
word “GOI” is an organization name, “Rio de Janeiro” is the name of place and, 
finally, “Juegos Olimpicos” is an event name classified into the category MISG. 
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If we want to implement a system that extracts named entities from plain text 
we would meet with two different problems, the recognition of named entities 
and their classification: 

— Named Entity Recognition (NER) is the identification of the word sequence 
that conforms the name of an entity. 

— Named Entity Classification (NEC) is the subtask in charge of deciding which 
is the category assigned to a previously recognized entity. 

There are systems that perform both subtasks at once. Other systems, how- 
ever, make use of two independent subsystems to carry out each subtask sequen- 
tially. The second architecture allows us to choose the most suitable technique to 
each subtask. Named entity recognition is a typical grouping task (or chunking) 
while choosing its category is a classification problem. Therefore, chunking tools 
can be used to perform the first task, and classification tools for the second one. 
In practice, it has been shown [4] that the division into two separate subtasks is 
a very good option. 

Our approach to the NEE problem is based on the separate architecture. We 
have focused the work presented in this paper on improving the performance 
of the first subtask of the architecture, the NER module. We have applied two 
main techniques: 

~ Corpus transformation: that allows us to train the taggers with different 
views of the original corpus, taking more advantage of the information con- 
tained in it. 

— System combination: that takes into account the tags proposed by several 
systems to decide if a given word is part of a named entity. 

The organization of the rest of the paper is as follows. The second section 
presents the resources, measures and baselines used in our experiments. In sec- 
tion three we describe the transformations that we have applied to obtain three 
additional versions of the training corpus. Section four describes the different 
methods that we have used to combine the tags proposed by each model. In 
section five we draw the final conclusions and point out some future work. 



2 Resources and Baselines 

In this section we describe the main resources used: CoNLL-02 corpus, TnT and 
TBL. We also define the baselines for our experiments with the results of TnT 
and TBL trained with the original corpus. 

2.1 Corpus CoNLL-02 

This corpus provides a wide set of named entity examples in Spanish. It was used 
in the Named Entity Recognition shared task of CoNLL-02 [13]. The distribution 
is composed of three different files: 
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— Training corpus with 264715 tokens and 18794 entities. 

— Test-A corpus with 52923 tokens and 4315 entities. We have used this corpus 
as additional training material to estimate the parameters of some of the 
systems developed. 

— Test-B corpus with 51533 tokens and 3558 entities. We have used it only to 
obtain the final experimental results. 

The BIO notation is used to denote the limits of a named entity. The initial 
word of a named entity is tagged with a B tag, and the rest of words of a named 
entity are tagged with I tags. Words outside an entity are denoted with an O tag. 
There are four categories: PER (people), LOG (places), ORG (organizations) and 
MISG (rest of entities), so the complete set of tags is {B-LOG, I-LOG, B-PER, 
I-PER, B-ORG, I-ORG, B-MISG, I-MISG, O}. We do not need the information 
about the category for recognition purposes, so we have simplified the tag set 
by removing the category information from the tags. Figure 1 shows a fragment 
of the original corpus, and its simplified version. 



Word 


Tag 




Word 


Tag 


El 


O 




El 


O 


presidente 


O 




presidente 


O 


del 


o 




del 


o 


COI 


B-ORG 




COI 


B 


5 


O 






O 


Juan 


B-PER 




Juan 


B 


Antonio 


I-PER 




Antonio 


I 


Samaranch 


I-PER 




Samaranch 


I 



Fig. 1. Original corpus and corpus tagged only for the recognition subtask 



2.2 Taggers 

In order to have views as different as possible of the NER task we have chosen 
two taggers based upon radically different concepts, TnT and TBL. Both are 
publicly available and re-trainable. 

TBL [3] is a transformation based learning technique that makes use of the 
knowledge provided by tagging errors. The basic idea of TBL consists of ob- 
taining a set of rules that can transform an imperfect tagging into one with 
fewer errors. To achieve this goal, TBL implements an iterative process that 
starts with a naive tagging. This tagging is improved at each iteration learn- 
ing rules that transform it into another one closer to the correct tagging. TBL 
has been successfully used in several Natural Language Processing (NLP) tasks 
like shallow parsing, POS tagging, text chunking or prepositional phrase attach- 
ment. 
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TnT [1] is one of the most widely used re-trainable tagger in NLP tasks. It is 
based upon second order Markov Models, consisting of word emission probabil- 
ities and tag transition probabilities computed from trigrams of tags. As a first 
step it computes the probabilities from a tagged corpus through maximum like- 
lihood estimation, then it implements a linear interpolation smoothing method 
to manage the sparse data problem. It also incorporates a suffix analysis for 
dealing with unknown words, assigning tag probabilities according to the word 
ending. 

2.3 Measures 

The measures used in our experiments are, , and the overall per- 

formance measure T/ 3 =i. These measures were originally used for Information 
Retrieval (IR) evaluation purposes, but they have been adapted to many NLP 
tasks. Precision is computed according to the number of correctly recognized 
entities, and recall is defined as the proportion of the actual entities that the 
system has been able to recognize: 

correct entities correct entities 

Precision = Recall = 

all recognized entities actual entities 

Finally, F/ 3 =i combines recall and precision in a single measure, giving to 
both the same relevance: 

^ 2 Precision Recall 

Precision + Recall 

We will trust in F/ 3 =i measure for analyzing the results of our experiments. It 
is a good performance indicator of a system and it is usually used as comparison 
criterion. 

2.4 Baselines 

Table 1 shows the NER results obtained when TBL and TnT are trained with 
the CoNLL-02 corpus, we will adopt these results as baselines for the rest of 
experiments in this paper. TBL presents better results than TnT in the three 
measures, this will be a constant in the rest of experiments. In contrast, TBL is 
slower than TnT, while TnT trains in few seconds TBL needs several minutes 
to process the entire corpus. 

Table 1. Baselines. NER results for TnT y TBL trained with the original version of 
CoNLL-02 corpus 





Precision 


Recall 


F/3=1 


TnT 


84.39% 


86.12% 


85.25 


TBL 


85.34% 


89.66% 


87.45 
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3 Corpus Transformations 

It seems logical to think that if we have more information before taking a decision 
we have more possibilities of choosing the best option. For this reason we have 
decided to increase the number of models. 

There are two obvious ways of building new models: using new training cor- 
pora or training other taggers with the same corpus. We have tried a different 
approach, defining three transformations that applied to the original corpus give 
us three additional versions of it. With four different views of the same informa- 
tion, the taggers learn in a different way and the resulting models can specialize 
in the recognition of named entities of different nature. Transformations can be 
defined to simplify the original corpus or to add new information to it. If we 
simplify the corpus we reduce the number of possible examples and the sparse 
data problem will be smoothed. On the other hand if we enrich the corpus the 
model can use the added information to identify new examples not recognized 
in the original model. In the following subsection we describe the transforma- 
tion explored in our experiments. Figure 2 show the results of applying these 
transformations to the example of Figure 1. 



Word 


Tag 


El 


0 


presidente 


0 


del 


0 


_alLcap_ 


B 




0 


_starts_cap_ 


B 


_starts_cap_ 


I 


_starts_cap_ 


I 



a) Vocabulary 
reduction. 



Word 


Tag 


El 


O 


presidente 


0 


del 


0 


COI 


BE 




0 


Juan 


B 


Antonio 


I 


Samaranch 


E 



b) Change of 
tag set. 



Word 


Tag 


ELdet_ 


O 


presidente_noun_ 


O 


del_prep_ 


o 


_alLcap__noun_ 


B 


,_punt_ 


O 


_cap__noun_ 


B 


_cap__noun_ 


I 


_cap__noun_ 


I 



c) Addition of POS 



information. 



Fig. 2. Result of applying transformations to the corpus fragment showed in Figure 1 



3.1 Vocabulary Reduction 

This transformation discards most of the information given by words in the cor- 
pus, emphasizing the most useful features for the recognition of named entities. 
We employ a technique similar to that used in [11] replacing the words in the 
corpus with tokens that contain relevant information for recognition. 

One of the problems that we try to solve with this transformation is the 
treatment of unknown words. These are the words that do not appear in the 
training corpus, and therefore the tagger can not make any assumption about 
them. Handling unknown words is a typical problem in almost all corpus based 
applications, in the case of named entity recognition is even more important 
because unknown words are good candidates to be part of an entity. The lack 
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of information of an unknown word can be mitigated with its typographic infor- 
mation, because in Spanish (like many other languages) capitalization is used 
when writing named entities. 

Apart from typographic information there are other features that can be 
useful in the identification of entities, for example non-capitalized words that 
frequently appear before, after or inside named entities. We call them trigger 
words and they are of great help in the identification of entity boundaries. 

Both pieces of information, trigger words and typographic clues, are extracted 
from the original corpus through the application of the following rules: 

— Each word is replaced by a representative token, for example, it _starts_cap_ 

for words that start with capital letters, _ _ for words that are written in 

lower case letter, _ _ _ if the whole word is upper case, etc. These word 

patterns are identified using a small set of regular expressions. 

— Not all words are replaced with its corresponding token, the trigger words 
remain as they appear in the original corpus. The list of trigger words is 
computed automatically counting the words that most frequently appear 
around or inside an entity. 

Vocabulary reduction leads to an improvement in the performance of TnT 
and TBL. The results of the experiments ( and ) are presented in 

Table 2. TnT improves from 85.25 to 86.63 and TBL improves from 87.45 to 
88 . 10 . 

3.2 Change of Tag Set 

This transformation does not affect to words but to tags. The basic idea is 
to replace the original BIO notation with a more expressive one that includes 
information about the words that usually end a named entity. The new tag set 
have five tags, the three original (although two of them change slightly their 
semantic) plus two new tags: 

— B, that denotes the beginning of a named entity with more than one word. 

— BE, that is assigned to a single-word named entity. 

— I, that is assigned to words that are inside of a multiple- word named entity, 
except to the last word. 

— E, assigned to the last word of a multiple-word named entity 

— O, that preserves its original meaning: words outside a named entity. 

This new tag set give more relevance to the position of a word, forcing the 
taggers to learn which words appear more frequently at the beginning, at the 
end or inside a named entity. 

Changing the tag set also leads to better results than those obtained with 
the original corpus. The results of the experiments ( and ) are 

showed in Table 2. TBL improves from 87.45 to 87.61 and TnT improves from 
85.25 to 86.83, the best result achieved with TnT (with an error reduction of 
over 10%). 
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3.3 Addition of Part-of- Speech Information 

Unlike the previous corpus transformations, in this case we will make use of 
external knowledge to add new information to the original corpus. Each word 
will be replaced with a compound tag that integrates two pieces of information: 

~ The result of applying the first transformation (vocabulary reduction). 

“ The part of speech (POS) tag of the word. 

To obtain the POS tag of a word we have trained TnT with the Spanish 
corpus CLiC-TALP [5]. This corpus is a one hundred thousand word collection 
of samples of written language, it includes extracts from newspapers, journals, 
academic books and novels. It is completely tagged, each word has a lemma and a 
tag that indicates its part of speech and additional information like number, tense 
or gender. In our experiments we only have used the part of speech information. 

We make use of a compound tag in the substitution because the POS tag 
does not provide enough information to recognize an entity. We would miss 
the knowledge given by typographical features. For this reason we decided to 
combine the POS tag with the tag resulting of the application of the vocabulary 
reduction transformation. The size of the new vocabulary is greater than the 
obtained with the first transformation, but it is still smaller than the size of the 
original vocabulary. So, besides the incorporation of the POS tag information, we 
still take advantage of the reduction of vocabulary in dealing with the unknown 
word and sparse data problems. 

Adding part of speech information also implies an improvement in the perfor- 
mance of TBL and TnT. Table 2 presents the results of the experiments 
and . TnT improves from 85.25 to 86.69 and TBL improves from 87.45 

to 89.22, the best result achieved with TBL (an error reduction of over 14%). 



Table 2. Results of transformation experiments 





Precision 


Recall 


E/3=1 


TnT 


84.39% 


86.12% 


85.25 


TnT-V 


85.19% 


88.11% 


86.63 


TnT-N 


86.21% 


87.47% 


86.83 


TnT-P 


85.33% 


88.09% 


86.69 


TBL 


85.34% 


89.66% 


87.45 


TBL-V 


87.72% 


88.48% 


88.10 


TBL-N 


86.78% 


89.07% 


87.91 


TBL-P 


89.14% 


89.29% 


89.22 



4 System Combination 

The three transformations studied cause an improvement in the performance of 
the NER task. This proves that the two techniques employed, adding information 
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and removing information, can produce good versions of the original corpus 
through different views of the same text. 

But we still have room for improvement if instead of applying the trans- 
formations separately we make them work together. We can take advantage of 
discrepancies among models to choose the most suitable tag for a given word. 
In the following sections we present the experiments carried out by combining 
the results of the eight models using different combination schemas. All of these 
schemas achieve better values for F/^^i than the best of the participant models 
in isolation. 

System combination is not a new approach in NPL tasks, it has been used 
in several problems like part of speech tagging [7], word sense disambiguation 
[9], parsing [8], noun phrase identification [12] and even in named entity extrac- 
tion [6] . The most popular techniques are voting and stacking (machine learning 
methods), and the different views of the problem are usually obtained using 
several taggers or several training corpora. In this paper, however, we are inter- 
ested in investigate how these methods behave when the combined systems are 
obtained with transformed versions of the same training corpus. 

4.1 Voting 

The most obvious way of combining different opinions about the same task is 
voting. Surprisingly, and despite its simplicity, voting gives very good results, 
better even than some of the more sophisticated methods that we will present 
in further subsections. We have carried out two experiments based on this com- 
bination scheme: 

“ : one model, one vote. The opinion of each model participant in the 

combination is equally important. 

— : giving more importance to the opinion of better models. The vote 

of a model is weighted according to its performance in a previous evaluation. 

Table 3 shows the results for these experiments, both achieved better values 
for T/ 3 =i than the best of the participant models ( with 89.22). 

reached 89.97 and 90.02. 

4.2 Stacking 

Stacking consists in applying machine learning techniques for combining the 
results of different models. The main idea is to build a system that learns the 
way in which each model is right or makes a mistake. In this way the final 
decision is taken according to a pattern of correct and wrong answers. 

In order to be able to learn the way in which every model is right or wrong, 
we need a set of examples, known as in machine learning termi- 

nology. Each example in the training database includes the eight tags proposed 
by the models for a given word (we call them features) and the actual tag (we 
call it class). From this point of view, deciding the tag given the tags proposed 
by several models is a typical classification problem. 
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Figure 3 shows a small database written in “arff” format, the notation em- 
ployed by [14] to represent training databases. is a collection of 

machine learning algorithms for data mining tasks, and is the tool that we have 
used in our stacking experiments. 



^relation ( 


^elaboration 








^attribute 


TnT 


10. 


B, 


1} 


©attribute 


TnT-VOC-RED 


{0, 


B, 


1} 


©attribute 


TnT-NEW-TAGS 
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1} 
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TnT-POS 
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B, 


1} 


©attribute 


TBL 


{0. 


B, 


1} 


©attribute 


TBL-VOC-RED 


{0, 


B, 


1} 


©attribute 


TBL-NEW-TAGS 


{0, 


B, 


1} 


©attribute 


TBL-POS 


{0, 


B, 


1} 


©attribute 

©data 


ACTUAL-TAG 


{0, 


B, 


1} 



I, I, I, B, B, B, I, I, I 

0, 0, 0, D, 0, 0, 0, 0, 0 

B, B, B, B, B, B, B, B, B 

1, I, I, I, I, I, I, I, I 

0, I, I, I, I, I, 0, 0, I 

B, I, I, I, I, I, B, B, I 

0, 0, D, 0, 0, 0, 0, 0, 0 

0, 0, 0, D, 0, 0, 0, 0, 0 

B, B, B, 0, 0, B, B, B, 0 



Fig. 3. A training data base. Each register corresponds to a word 



Most of the examples in figure 3 can be resolved with a voting scheme, be- 
cause the majority opinion agrees with the actual tag. However the last example 
presents a different situation, six of the eight models assign the tag B to the 
word in question, while only two (TnT-POS and TBL) assign the correct tag 
O. If this example is not an isolated one, it would be interesting to learn it and 
assign the tag O to those words that present this answer pattern. 

The number of examples in the training database has a considerable influence 
in the learning process, using more examples usually leads to a better perfor- 
mance of the classifier. We have used the other evaluation corpus (test A) to 
generate them, it is independent of the models and it is also independent of the 
evaluation process. This corpus, with 52923 tokens, provides enough examples 
to learn a good classifier. 

Table 3 shows the results of the experiment , carried out using a decision 
tree [10] as stacking technique. The measure is 89.72, better than the best 
of participant models ( with 89.22) but worse than the value obtained 

by voting (90.02). This does not mean that stacking is a worse technique than 
voting, we will see in the following experiments that stacking achieves better 
results than voting. The fact is that the generalization process carried out to 
induce the tree does not cover all the examples, but there is still a feature 
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of stacking that can compensate this phenomenon, the possibility of merging 
heterogeneous information. 

4.3 Adding Contextual Information 

As we have mentioned before one of the advantages of machine learning tech- 
niques is that they can make use of information of different nature. While in 
the voting scheme we only can take into account the tags proposed by the eight 
models, in a training database we can include as many features as we consider 
important for taking the correct decision. 

One of the most valuable information is the knowledge of the tags assigned 
to the words around the word in question. To add this information we only have 
to extend the number of features of each example in the database. Now, besides 
the tags assigned by each model to a word we will include in the database the 
tags assigned by the models to the surrounding words. We have carried out two 
experiments varying the number of words included in the context: 

— Tree-1: We only include the tags of the previous and the following words. So 
each example has twenty four features. 

— Tree-2: The tags of the two previous words and the two following words are 
included. So each example has now, forty features. 

In both experiments decision tree is the technique employed to learn the 
classifier. Table 3 shows the results of the experiments and , both 

improve the results of voting and stacking without contextual information, 
got a value of 90.23 in T/ 3 =i measure, and got 90.48. 

Bigger values of the context increase considerably the number of features in 
the database and do not lead to better results. 

4.4 Bagging 

Apart from allowing the use of heterogeneous information, machine learning have 
another important advantage over voting: it is possible to choose among a great 
variety of schemes and techniques to find the most suitable one to each problem. 

[2] is one of this schemas, it provides a good way of handling the possible 
bias of the model towards some of the examples of the training database. 

Bagging is based on the generation of several training data sets taking as base 
a unique data set. Each new version is obtained by sampling with replacement 
the original database. Each new data set can be used to train a model and the 
answers of all the models can be combined to obtain a joint answer. Generally, 
bagging leads to better results than those obtained with a single classifier. The 
price to pay is that this kind of combination methods increase the computational 
cost associated to learning. 

Table 3 shows the results of the experiment . In this experiment we 

apply this scheme using a decision tree as base learner. With this method we 
obtain the best result (90.90), with an error reduction of over 38% and 27% with 
respect to the baselines given, respectively, by and experiments. 




Named Entity Recognition 



265 



Table 3. Results of combination experiments 





Precision 


Recall 


U/3=l 


Voting 


89.67% 


90.28% 


89.97 


Voting-W 


89.43% 


90.62% 


90.02 


Tree 


88.93% 


90.53% 


89.72 


Tree-1 


89.54% 


90.92% 


90.23 


Tree- 2 


90.18% 


90.78% 


90.48 


Bagging 


90.69% 


91.12% 


90.90 



5 Conclusions and Future Work 

In this paper we have shown that the combination of several taggers is an effective 
technique for named entity recognition. Taking as baselines the results obtained 
when TnT and TBL are trained with a corpus annotated with named entity 
tags, we have investigate alternative methods for taking more advantage of the 
knowledge provided by the corpus. We have experimented with three corpus 
transformations, adding or removing information to the corpus. The three of 
them improve the results obtained with the original version of the corpus. 

The four versions of the corpus, and the two different taggers allow us to build 
eight models that can be combined using several techniques. All the proposed 
combination techniques improve the results of the best of the participant models 
in isolation. We have experimented with voting, stacking using a decision tree 
as learning technique, and stacking using bagging as learning scheme. Our best 
experiment achieved an measure of 90.90 what means an error reduction 
of 38.30% and 27.49% in relation to the baselines given by TnT and TBL. This 
performance is similar to state of the art NER systems, with comparable results 
to those obtained by the best system in the CoNLL-02 competition [4] that 
achieved an T/ 3 =i value of 91.66 in the recognition task. 

We have developed our systems for recognizing named entities in Spanish 
texts because we are specially interested in this language, but it would be easy to 
reproduce the experiments in other languages having the corresponding corpus. 

Much future work remains. We are interested in applying the ideas of this 
paper in the recognition of entities in specific domains. In this kind of tasks 
the knowledge about the domain could be incorporated to the system via new 
transformations. We also plan to take advantage of system combination to help in 
the construction of annotated corpus, using the jointly assigned tag as agreement 
criterion in co-training or active learning schemes. 
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Abstract. Word fragments or n-grams have been widely used to perform 
different Natural Language Processing tasks such as information retrieval [1] 
[2], document categorization [3], automatic summarization [4] or, even, genetic 
classification of languages [5]. All these techniques share some common 
aspects such as: (1) documents are mapped to a vector space where n-grams are 
used as coordinates and their relative frequencies as vector weights, (2) many of 
them compute a context which plays a role similar to stop-word lists, and (3) 
cosine distance is commonly used for document-to-document and 
query-to-document comparisons. blindLight is a new approach related to these 
classical n-gram techniques although it introduces two major differences: (1) 
Relative frequencies are no more used as vector weights but replaced by n-gram 
significances, and (2) cosine distance is abandoned in favor of a new metric 
inspired by sequence alignment techniques although not so computationally 
expensive. This new approach can be simultaneously used to perform document 
categorization and clustering, information retrieval, and text summarization. In 
this paper we will describe the foundations of such a technique and its 
application to both a particular categorization problem (i.e., language 
identification) and information retrieval tasks. 



1 Introduction 

A-grams are simply text sequences consisting of n items, not necessarily contiguous, 
which can be either words or characters. Frequently, the term n-gram refers to slices 
of adjoining n characters including blanks and running over different words. These 
character n-grams are well suited to support a vector space model to map documents. 
In such a model each document can be considered a D dimensional vector of weights, 
where D is the number of unique n-grams in the document set while the i-th weight in 
the vector is the relative frequency, within the document to be mapped, for the i-th 
n-gram. Thus, having two documents (or a query and a document) a simple similarity 
measure can be computed as the cosine of the angle between both vectors, this 
measure is especially interesting because it is not affected by length differences 
between the compared documents. This approach, exemplified in classical works such 
as [6] or [7], provides several advantages: it is language independent, quite robust in 
the face of typographical or grammatical errors and it does not require 
word-stemming or stop-word removal. 

Nevertheless, the n-gram vector model has more applications besides information 
retrieval (i.e., comparing a query with a document). By using the same cosine distance 
as similarity measure, the model can be applied to document clustering and 
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categorization [3] [8]. In addition to this, document vectors from similar documents (a 
cluster or a human-made document set) can be used to obtain a centroid vector [8]. 
Within this centroid, each i-th weight is just the average of the i-th weights from all 
vectors in the set. Such a centroid provides a “context” where performing document 
comparisons given that it must be subtracted from the document vectors involved in 
the process. An especially interesting application of n-grams where a context had to 
be provided was Highlights [4] which used vectors to model both documents and 
these document’s “background”'. Highlights extracted keywords automatically from a 
document with regards to its particular background. 

Such classical approaches show two major drawbacks: (1) since documents are 
represented by D dimensional vectors of weights, where D is the total amount of 
different n-grams in the whole document set, such vectors are not document represen- 
tations by themselves but representations according to a bigger “contextual” corpus, 
and (2) cosine similarities between high dimensional vectors tend to be 0 (i.e., two 
random documents have a high probability of being orthogonal to each other), so, to 
avoid this “curse of dimensionality” problem it is necessary to reduce the number of 
features (i.e. n-grams), which is usually done by setting arbitrary weight thresholds. 

blindLight is a new approach related to those described before and so, applicable to 
the tasks mentioned above (i.e., document categorization and clustering, information 
retrieval, keyword extraction or automatic summarization [9]) however it takes into 
account some important requisites to avoid the problems in previous solutions: First, 
every document must have assigned a unique document vector with no regards to any 
corpus and, second, another measure, apart from cosine similarity, has to be used. 



2 Foundations of the blindLight Approach 

blindLight, as other n-gram vector space solutions, maps every document to a vector 
of weights; however, such document vectors are rather different from classical ones. 
On one hand, any two document vectors obtained through this technique are not 
necessarily of equal dimensions, thus, there is no actual “vector space” in this 
proposal. On the other hand, weights used in these vectors are not relative frequencies 
but the significance of each n-gram within the document. 

Computing a measure of the relation between elements inside n-grams, and thus 
the importance of the whole n-gram, is a problem with a long history of research, 
however, we will focus in just a few references. In 1993 Dunning described a method 
based on likelihood ratio tests to detect keywords and domain- specific terms [10]. 
However, his technique worked only for word bigrams and were Ferreira da Silva and 
Pereira Lopes [11] the ones who presented a generalization of different statistical 
measures so these could be applied to arbitrary length word n-grams. In addition to 
this, they also introduced a new measure. Symmetrical Conditional Probability [12], 
which overcomes other statistics-based measures. According to Pereira Lopes, their 
approach obtains better results than those achieved by Dunning. 

blindLight implements the technique described by da Silva and Lopes although 
applied to character n-grams rather than word n-grams. Thus, it measures the relation 
among characters inside each n-gram and, so, the significance of every n-gram, or 
what is the same, the weight for the components in a document vector. 



* This background was not a centroid but built using the dataset as just one long document. 
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With regards to comparisons between vectors, a simple similarity measure such as 
the cosine distance cannot be straightforward applied when using vectors of different 
dimension. Of course, it could be considered a temporary vector space of dimension 
dj+d^, being d, and d^ the respective dimensions of the document vectors to be 
compared, assigning a null weight to one vector’s n-grams not present in the other and 
vice versa. However, we consider the absence of a particular n-gram within a 
document rather distinct from its presence with null significance. 

Eventually, comparing two vectors with different dimensions can be seen as a 
pairwise alignment problem. There are two sequences with different lengths and some 
(or none) elements in common that must be aligned, that is, the highest number of 
columns of identical pairs must be obtained by only inserting gaps, changing or 
deleting elements in both sequences. 

One of the simplest models of distance for pairwise alignment is the so-called 
Levenshtein or edit distance [13] which can be defined as the smallest number of 
insertions, deletions, and substitutions required to change one string into another (e.g. 
the distance between "accommodate" and "aconmodate" is 2). 

However, there are two noticeable differences between pairwise-alignning text 
strings and comparing different length vectors, no matter the previous ones can be 
seen as vectors of characters. First difference is rather important, namely, the order of 
components is central in pairwise alignment (e.g., DNA analysis or spell checking) 
while unsuitable within a vector-space model. Second one is also highly significant: 
although not taking into account the order of the components, “weights” in pairwise 
alignment are integer values while in vector-space models they are real. 

Thus, distance functions for pairwise alignment, although inspiring, cannot be 
applied to the concerned problem. Instead, a new distance measure is needed and, in 
fact, two are provided. Classical vector-space based approaches assume that the 
distance, and so the similarity, between two document vectors is commutative (e.g., 
cosine distance). blindLight, however, proposes two similarity measures when 
comparing document vectors. For the sake of clarity, we will called those two 
documents query (Q) and target (T) although these similarity functions can be equally 
applied to any pair of documents, not only for information retrieval purposes. 

Let Q and The two blindLight document vectors with dimensions m and n: 



Q (^2e’^2g) • 




(1) 


T {(^ir J ) (^2r’^27’) 




(2) 



k.j is the i-th n-gram in document j while w^j is the significance (computed using 
SCP [12]) for the n-gram L within the same document j. 

We define the total significance for document vectors Q and T, and Sj. 
respectively, as: 



^Q='L^iQ 

i=l 

n 

St = Wij 
1=1 



( 3 ) 



(4) 
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Then, the pseudo- alignment operator, £2, is defined as follows: 

4^. = = kjT ) A (w, = min(w,.g , w .^ )), 



QQT = 






m. 



(5) 



T,0<j<n 



Similarly to equations 3 and 4 we can define the total significance for <2^^- 



^QQ.T 






■QilT 



(6) 



Finally, we can define two similarity measures, one to compare Q vs. T, Ft 
(uppercase Pi), and a second one to compare T vs. Q, P (uppercase Rho), which can 
be seen analogous to precision and recall measures: 



H = S^,ISq (7) 

( 8 ) 

To clarify these concepts we will show a simple example based on (one of) the 
shortest stories ever written. We will compare original version of Monterroso’s 
Dinosaur with a Portuguese translation; the first one will play the query role and the 
second one the target, the n-grams will be quad-grams. 



Cuando desperto, el dinosaurio todavfa estaba alii. (Query) 

Quando acordou, o dinossauro ainda estava la. (Target) 

Fig. 1. “El dinosaurio” by Augusto Monterroso, Spanish original and Portuguese translation 

In summary, the blindLight technique, although vector-based, does not need a 
predefined document set where performing NLP tasks and so, such tasks can be 
achieved over ever-growing document sets or, just the opposite, over just one single 
document [9]. Relative frequencies are abandoned as vector weights in favor of a 
measure of the importance of each «-gram. In addition to this, similarity measures are 
analogous to those used in pairwise-alignment although computationally inexpensive 
and, also, non commutative which allows us to combine both measures, FI and P, into 
any linear combination to tune it to each NLP task. 

The rest of the paper will describe some test bed experiments to evaluate our 
prototypes at different tasks, namely, language identification, genetic classification of 
languages, and document retrieval. 



3 Language Identification and Genetic Classification of 
Languages Using blindLight 

Natural language identification from digital text has a long tradition and many 
techniques have been proposed: for instance, looking within the text for particular 
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Q vector (45 elements) T vector (39 elements) QQT (10 elements) 



Cuan 


2.489 


va_l 


2.545 


saur 


2.244 


Ldi 


2.392 


rdou 


2.323 


inos 


2.177 


stab 


2.392 


Slav 


2.323 


uand 


2.119 










_est 


2.091 


saur 


2.313 


saur 


2.244 


dino 


2.022 


desp 


2.313 


noss 


2.177 


_din 


2.022 










esta 


2.012 


ndo_ 


2.137 


a_la 


2.022 


ndo_ 


1.981 


nosa 


2.137 


o 

o 


2.022 


a_es 


1.943 










ando 


1.876 


ando 


2.012 


auro 


1.908 






avia 


1.945 


ando 


1.876 






_all 


1.915 


do_a 


1.767 


n: 0.209 P: 0.253 



Fig. 2. blindLight document vectors for both documents in Fig. 1 (truncated to show ten 
elements, blanks have been replaced by underscores). QDT intersection vector is shown plus If 
and P values indicating the similarities between both documents 



characters [14], words [15], and, of course, n-grams [16] or [17]. Techniques based on 
«-gram vectors using the cosine distance perform quite well in this task. Such 
techniques usually follow these steps: (1) For each document in the corpus they create 
an n-gram vector, (2) while creating document vectors a centroid vector is also 
computed, and (3) when an unknown text sample is presented to the system it is (3-a) 
mapped into the vector space, (3-b) the centroid is subtracted from the sample, and 
(3-c) compared, by means of the cosine distance, with all the reference documents in 
the set (which also have had the centroid subtracted). Finally, the reference document 
found most similar to the sample is used to inform the user in which language the 
sample is probably written. 

The application of blindLight to the construction of a language identifier supposes 
some subtle differences to previous approaches. On one hand, it is not necessary to 
subtract any centroid neither from reference documents or the text sample. On the 
other hand, our language identifier does not need to compare the sample vector to 
every reference language in the database because a language tree is prepared in 
advance in order to take the number of comparisons to a minimum. 

The language identifier to be described is able to distinguish the following 
European languages: Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, 
French, German, Italian, Norwegian, Portuguese, Spanish and Swedish. To assure that 
the identification is solely made on the basis of the language and is not biased by the 
contents of the reference documents the whole database consists of literal translations 
of the same document: the first three chapters of the Book of Genesis. 

To build the first version of the language identifier was pretty simple. First, an 
M-gram vector was obtained for every translation of the Genesis. Afterwards, a 
similarity measure, based on FI and P, was defined, being eventually just FI (being the 
submitted sample the query). Finally, the identifier only needs to receive a sample of 
text, to create an «-gram vector for that sample and to compute the FI similarity 
between the sample and each reference document. The highest the value of FI, the 
most likely the language in the reference to be the one used in the sample. 

The second version of the language identifier was inspired by the appealing idea of 
performing genetic classification of languages (i.e., determining how different human 
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languages relate to each other) by automatic means. Of course, this idea has yet been 
explored (e.g., [5], [18], or [19]). However, many of these approaches employ either 
classical vector-space techniques or Levenshtein distance rightly applied to character 
or phoneme sequences, so, we found rather challenging the application of blindLight 
to this problem. 

The genetic classification of languages using blindLight was performed over two 
different linguistic data. The first experiment was conducted on the vectors obtained 
from the text of the Book of Genesis and, so, produced a tree with 14 languages 
(Fig. 3). The second experiment, involved vectors computed from phonetic 
transcriptions of the fable “The North Wind and the Sun” which were mainly obtained 
from the Handbook of the International Phonetic Association [20]. The languages that 
took part in this second experiment were: Catalan, Dutch, English, French, Galician, 
German, Portuguese, Spanish, and Swedish. This task produced a tree with 9 
languages (Fig. 4), from which 8 were also present in the results from the first 
experiment. Both experiments used as similarity measure the expression 0.5IJ+0.5P, 
thus, establishing a commutative similarity measure when comparing languages. A 
technique similar to Jarvis-Patrick clustering [21] was used to build the dendrograms 
(Figures 4 and 5), however, describing this technique is out of the scope of this paper. 

We are not linguists so we will not attempt to conclude anything from both 
experiments. Nevertheless, not only both trees are coherent to each other but most of 
the relations shown in them are also consistent, to the best of our knowledge, with 
linguistics theories. Even the close relation shown between Catalan and French, both 
lexically and phonetically, finds support in some authors (e.g., Pere Verdaguer [22]), 
although it contrasts with those classifications which consider Catalan an Ibero 
Romance language rather distant from Oil family (to which French belongs). 

The data obtained from the lexical comparison of languages was used to prepare a 
set of artificial mixed vectors^, namely, Catalan- French, Danish-Swedish, 
Dutch-German, Portuguese-Spanish, Italic, northGermanic, and 
westGermanic. Such vectors are simply the O-intersection of the different reference 
vectors belonging to each category (e.g., westGermanic involves Dutch, English, 
and German vectors). 

To determine the language in which a sample is written two steps are followed: First, 
the sample is discriminated against Basque, Finnish, Italic, northGermanic and 
westGermanic. Then, depending in the broad category obtained, a second phase of 
comparisons may be needed. Once these two phases have been completed the system 
informs the user both about the language and the family to which it belongs. 

Although the language identifier needs thorough testing, an extremely simple 
experiment was performed to get some feedback about its accuracy. 1,500 posts were 
downloaded from five soc . culture . * newsgroups, namely, basque, Catalan, 
french, galiza, and german. 



^ It can be argued that the described language classification experiments were not needed given 
that actual language classifications are well-known, at least not to build the language 
identifier. Nevertheless, such experiments were, in fact, essential because it could not be 
assumed that artificial language vectors built using blindLight would work as expected by 
only taking into account data provided by non computational classifications. 
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Fig. 3. Unrooted dendrogram showing 
distances between 14 European written 
language samples (three first chapters of 
the Book of Genesis) 



English 




Fig. 4. Unrooted dendrogram showing distan- 
ces between 9 European oral language sam- 
ples (phonetic transcriptions of the fable “The 
North Wind and the Sun”). Distance between 
Gallo-Iberian (on the left) and Germanic 
subtrees is 23.985, more than twice the 
distance shown in the picture 



We included, posts from soc. culture. galiza to test the system with unknown 
languages. It must be said that really few posts in that group are actually written in 
Galician. From those which were actually Galician language samples 63.48% were 
classified as Portuguese and 36.52% as Spanish, which seems quite reasonable. 

Each raw post, without stripping any header information^, was submitted to the 
language identifier. Then, if the language assigned by the prototype did not match the 
supposed language for that post (according to its origin newsgroup) it was human 
reviewed to check if it was either a system’s fault (e.g., assigning English to a post 
written in any other language), or an actual negative (e.g., a German document posted 
to soc . culture . french). Each fault was added to the count of positives to obtain 
the total amount of documents written in the target language within its newsgroup 
and, thus, to compute the language identifier accuracy for that language. Eventually, 
this was a daunting task because many of these newsgroups suffer from heavy spam 
and cross-posting problems. The results obtained with this experiment are shown in 
the following table. 



4 Information Retrieval Using blindLight 

An information retrieval prototype built upon this approach is participating at the 
GLEE"* 2004 campaign at the moment of writing this paper. As with any other 
application of blindLight, a similarity measure to compare queries and documents is 
needed. At this moment just two have been tested; TI and a more complex one (see 
equation 9) which provides rather satisfactory results. 



^ Not stripping the header was done in order to check the system’s tolerance to “noise” (i.e., 
the presence of many English-like text). It was founded that documents with an actual 
language sample of around 200 characters could be correctly classified in spite of being 
attached to quite lengthy headers (from 500 to more than 900 characters). 

Cross Language Evaluation Forum (http : / /www . clef -campaign . org). 
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Table 1. Partial results achieved by the language identifier. Accuracy is the fraction of total 
documents from the newsgroup written in the target language that were correctly identified 



Newsgroup 


Languages found in the 
sample posts 


Target 

language 


Accuracy 


soc . culture . basque 


Spanish 96.87% 

Basque 2.19% 

English 0.94% 


Basque 


100% 


soc . culture . Catalan 


Catalan 51 .63% 

Spanish 48.37% 


Catalan 


98.44% 


soc . culture . french 


English 73.85% 

French 25.23% 


French 


97.56% 




German 0.92% 








German 50.35% 






soc . culture . german 


English 48.94% 

French 0.71% 


German 


97.18% 



n + norm(nP) (9) 

2 

The goal of the norm function shown in previous equation is just translate the 
range of IT-P values into the range of Tt values, making thus possible a comprehensive 
combination of both (otherwise, P, and thus Tt-P values, are negligible when 
compared to IT). 

The operation of the blindLight IR system is really simple; 

- For each document in the database an n-gram vector is obtained and stored, just in 
the same way it can be computed to obtain a summary, a list of keyphrases or to 
determine the language in which it is written. 

- When a query is submitted to the system this computes an n-gram vector and 
compares it with every document obtaining Tt and P values. 

- From these values a ranking measure is worked out, and a reverse ordered list of 
documents is returned as a response to the query. 

This way of operation supposes both advantages and disadvantages: documents 
may be added to the database at any moment because there is no indexing process; 
however, comparing a query with every document in the database can be rather time 
consuming and not feasible with very large datasets. In order to reduce the number of 
document-to-query comparisons a clustering phase may be done in advance, in a 
similar way to the language tree used within the language identifier. Of course, by 
doing this the working over ever-growing datasets is no more possible because the 
system should be shut down periodically to perform indexing. Thorough performance 
analysis is needed to determine what database size requires this previous clustering. 

There are no yet results about this system’s performance at CLEF experiments, 
however, it was tested on two very small standard collections with promising results. 
These collections were CACM (3204 documents and 64 queries) and CISI (1460 
documents and 112 queries). Both were originally provided with the SMART system^ 



^ Available at ftp : //ftp . cs . Cornell . edu/pub/ smart 
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and have become a widely used benchmark, thus, enabling comparisons between 
different IR systems. 

Figure 6 shows the interpolated precision-recall graphs for both collections and 
ranking measures (namely, pi and piro). Such results are similar to those reached by 
several systems but not as good as those achieved by other ones; for instance, 11 -pt. 
average precision was 16.73% and 13.41% for CACM and CISI, respectively, while 
the SMART IR system achieves 37.78% and 19.45% for the same collections. 
However, it must be said that these experiments were performed over the documents 
and the queries just as they are, that is, common techniques such as stop-word 
removal, stemming, or weighting of the query terms (all used by SMART) were not 
applied to the document set and the queries were provided to the system in a literal 
fashion®, as if they were actually submitted by the original users. By avoiding such 
techniques, the system is totally language independent, at least for non ideographic 
languages, although performance must be improved. 

In addition to this, it was really simple to evolve this system towards 
cross-language retrieval (i.e., a query written in one language retrieves documents 
written in another one). This was done without performing machine translation by 
taking advantage of a sentence aligned corpus of languages source (S) and target (T). 

The query written in the source language, Q^, is splitted in word chunks (from one 
word to the whole query). The S corpus is gathered looking for sentences containing 
any of these chunks. Every sentence found in S is replaced by its counterpart in the T 
corpus. All sentences from T corresponding to each chunk within the original query 
are ti-intersected. Since such sentences contain, allegedly, the translation of some 
words from language S into language T, it can be supposed that the Q-intersection of 
their vectors would contain a kind of “translated” w-grams (see Fig. 6). 

Thus, it is obtained a vector similar, in theory, to that which could be compute 
from a real translation from the original query. To build this pseudo-translator within 
the blindLight IR prototype the European Parliament Proceedings Parallel Corpus 
1996-2003 [23] has been used obtaining interesting results: in average terms, 38.59% 
of the M-grams from pseudo-translated query vectors are present within the vectors 
from actual translated queries and, in turn, 28.31% of the w-grams from the actual 
translated query vectors correspond to w-grams within the pseudo-translated ones. 



5 Conclusions and Future Work 

Gayo et al. [9] introduced a new technique, blindLight, claiming it could be used to 
perform several NLP tasks, such as document clustering and categorization, language 
identification, information retrieval, keyphrase extraction and automatic 
summarization from single documents, showing results for these two last tasks. 

In this paper the vector model used within blindLight has been refined and the 
similarity measures used to perform document comparisons have been formalized. In 
addition to this it has been shown that such a technique can be really applied to 
language identification, genetic classification of languages and cross-language IR. 



® Just an example query from the CACM collection: #64 List all articles on ELI and 
ECL (ELI may be given as EL/1; I don't remember how they did it. The 
blindLight IR prototype processes queries like this one in an “as is” manner. 
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Interpolated P-R graphs 




CACM(pi ranking) CACM (piro ranking) 

CISI (pi ranking) ... . CISI (piro ranking) 

Fig. 5. Interpolated precision-recall graphs for the blindLight IR system applied to CACM and 
CISI test collections. Top-10 average precision for CACM and CISI was 19.8% and 19.6% 
respectively, in both cases using piro ranking 



Query written in language S (from CLEF 2004 French topic list) 

Trouver des documents evoquant des discussions sur ia reforme des institutions financieres, en 
particuiier ia Banque Mondiaie et ie Fond Monetaire Internationai, iors du sommet du G7 qui a eu 
lieu a Haiifax en 1995. 

Some sentences from corpus S (Europarl French) 

(0861) ...ia Conference intergouvernementale sur ia reforme des institutions europeennes... 
(1104) ...i'etat des travaux concernant ia reforme des institutions , reforme qui... 

(5116) ...ie seui grand defi qui se pose a i'Union est ia reforme des institutions de i'UE... 

Counterpart sentences from corpus T (Europarl English) 

(0861) ...The Intergov. Conferenc. to address [...] the reform of the European institutions... 

(1104) ...the state of progress in the reform of the institutions, which is... 

(5116) ...the singie greatest chaiienge facing the Union is the reform of the EU institutions... 

Pseudo-translated query vector (Q-intersection of previous T sentences) 

(..., Jns, _ref, _the, efor, form, inst, itut, nsti, orm_, refo, stit, the_, lion, titu, tuti, utio, ...) 

Fig. 6. Procedure to pseudo-translate a query written originally in a source language (in this 
case French) onto a vector containing appropriate n-grams from the target language (English in 
this example). Blanks have been replaced by underscores, just one chunk from the query has 
been pseudo-translated 

The partial results obtained for language identification prove that it is a robust 
technique, showing an accnracy higher than 97% with an information-to-noise ratio 
around 2/7. 
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The application of this approach to automatic genetic classification of languages, 
both to lexical and phonological input, produced data coherent to most of linguistics 
theories and, besides this, useful to improve the operation of a language identifier 
built using the very same technique. 

The performance achieved when applying blindLight to IR is not as good as some 
IR systems but close to many others. However, it must be noticed that common 
techniques such as stop-word removal or stemming are not used. This surely has 
impacted on performance but, this way, the approach is totally language independent. 
On the other hand, it has been shown how easily cross-language IR can be 
implemented by performing pseudo-translation of queries (i.e., queries are not 
actually translated but parallel corpora is used to obtain a vector containing n-grams 
highly alike to be present in actual translations). 

Therefore, an extremely simple technique relying on the mapping of documents to 
«-gram vectors in addition to a metric able to compare different length vectors 
appears to be flexible enough to be applied to a wide range of NLP tasks showing in 
all of them adequate performance. 
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Abstract. We present a novel approach to incorporating semantic in- 
formation to the problems of natural language processing, in particular 
to the document classification task. The approach builds on the intu- 
ition that semantic relatedness of words can be viewed as a non-static 
property of the words that depends on the particular task at hand. The 
semantic relatedness information is incorporated using feature transfor- 
mations, where the transformations are based on a feature ontology and 
on the particular classification task and data. We demonstrate the ap- 
proach on the problem of classifying MEDLINE-indexed documents us- 
ing the MeSH ontology. The results suggest that the method is capable 
of improving the classification performance on most of the datasets. 



1 Introduction 

Many natural language processing tasks can benefit from information about se- 
mantic relatedness of words. For example, the methods for information retrieval 
and text classification tasks can be extended to capture information about words 
that are lexically distinct but semantically related. This is in contrast with the 
common bag-of-words representation of text where no semantic relatedness in- 
formation is captured. Information on semantic relatedness of words can be 
beneficial in at least two practical ways. Combining the related cases that would 
be distinct in the standard bag-of-words representation may result in a better 
predictor, for example, by yielding more accurate maximum-likelihood estimates 
in probabilistic methods such as the naive Bayes classifier. Further, words that 
are very rare or even unseen during training, but are closely semantically related 
to some more frequent word, can be used as a source of information. 

Semantic networks such as WordNet^ and UMLS^ are obvious sources of se- 
mantic knowledge about words. The semantic networks are usually represented 
as graphs with nodes representing words and edges representing semantic rela- 
tionships such as synonymy, hypernymy, and meronymy, for example. 

One way to incorporate the information on semantic relatedness of words is 
to define a quantitative measure that can be used in various classification and 



^ http://www.cogsci.princeton.edu/~wn/ 

^ http:/ /www. nlm.nih.gov/research/umls/ 
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clustering techniques. The need for such a quantitative measure has given rise 
to various techniques that measure pairwise word semantic relatedness based on 
semantic networks. 

The approach of Rada and Bicknell ([!]) defines the strength of the relation- 
ship between two words in terms of the minimum number of edges connecting 
the words in the semantic network graph. Resnik ([2]) argues that the seman- 
tic distance covered by single edges varies and employs a corpus-based method 
for estimating the distance of related concepts. Budanitsky ([3]) presents an 
application-oriented evaluation of these two and three other methods. It should 
be noted that these methods aim to measure the strength of the pairwise word 
relationship as a static property of the words, that is, the strength of the rela- 
tionship is defined independently of the task at hand. 

In this paper, we devise and investigate techniques that are based on the 
intuition that an optimal measure of semantic relatedness is not a static prop- 
erty of words, but depends also on the problem at hand. To illustrate the intu- 
ition, let us consider the task of text classification and the two words “mouse” 
and “human”. For many text classification tasks, it would be beneficial to con- 
sider “mouse” and “human” to be relatively distant, but in case of the hypo- 
thetical classification task where the goal is to distinguish between documents 
about eucaryotes and procaryotes, it might be beneficial to consider “mouse” 
and “human” similar or even synonymous. Conversely, the two words “wheat” 
and “oat” would typically be considered closely related, but, for example, in 
the Reuters-21578 classification dataset,^ where the two words define distinct 
classes, it would be beneficial to consider the words unrelated. Relating fea- 
tures in a task-specific manner has also been considered by Baker and McCal- 
lum ([4]), who introduce a feature clustering method with a primary focus on 
dimensionality reduction. However, their method is not governed by semantic 
networks, but it is based on the distribution of class labels associated with each 
feature. 

Instead of defining a quantitative measure of the strength of semantic rela- 
tionship between words, we incorporate the semantic information in the form 
of transformations based on the hierarchical ontology that underlies the words. 
The relations encoded in the hierarchy are the starting point of the proposed 
method. The method then operates on a given training set for a given problem, 
and it attempts to identify elementary transformations of the features that are 
beneficial to the performance of a machine learning method for the problem. 
Roughly, each transformation decides on the relatedness or unrelatedness of a 
set of words. In the mouse vs. human example above, the method would be ex- 
pected to relate the words “mouse” and “human” only if such a step improves 
the performance of the machine learning method on the task. 

We apply the method to a classification of MEDLINE-indexed^ documents, 
where each document is annotated with a set of terms from the MeSH 



® http://www.daviddlewis.com/resources/testcollections/reuters21578/ 
^ http://www.nlm.nih.gov/ 
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ontology®. However, the method is applicable to any problem where the fea- 
tures are organized in a hierarchy and a measure of performance can be defined. 

The paper is organized as follows. In Section 2 we define the necessary con- 
cepts and describe the method. Section 3 describes an application of the method 
to biomedical literature mining based on the MeSH ontology. The empirical re- 
sults and possible future directions are discussed in Section 4, and Section 5 
concludes the paper. 

2 Feature Transformations 

In this section we define a feature hierarchy in the form of a tree and present 
some of the possible elementary feature mappings based on the hierarchy. 

2.1 Feature Hierarchy 

Let F be a finite set of possible features that are organized into a “is a” concept 
hierarchy in the form of a tree. Let a, b G F he features. If a is a child of b, 
denoted as a b, we say that a is a of & and 6 is a 

of a. If a is a descendant of b, denoted as a ~<* b, we say that a is 
a of 6 and 5 is a of a. 

Let further G{b) = {a \ b a} be the set of all generalizations of b. Similarly, 
let S{a) = {b \ b a} be the set of all specializations of a. We say that a is 
if it is the root of the hierarchy, that is, G(a) = 0. Similarly, we say 
that 5 is if it is a leaf of the hierarchy, that is, S(b) = 0. 



2.2 Elementary Transformations of Feature Multisets 

Each document is represented as a multiset X C F of features extracted from the 
document. Before X is passed to a text classifier, it undergoes a transformation, 
which may remove some features from the multiset, or replace some features with 
a multiset of (possibly different) features. The feature multiset transformations 
are independent of the classification method used for the data, since the classifier 
is applied only after the features were transformed. 

In order to search through the space of possible feature multiset transfor- 
mations, we define a set of elementary transformations, where each elementary 
transformation is a feature multiset mapping 2^ — > 2^. A locally optimal trans- 
formation is obtained as a composition of several elementary transformations. 
In the following, we consider some of the possible elementary transformations. 

Generalization to a Feature. The generalization to a feature transformation 
is parametrized by a feature a G F and it causes all features belonging to S{a) 
(that is, features more specific than a) to be replaced by the feature a, in other 
words, the whole subtree under the feature a is “folded” up to the feature a. 



® http:/ /www. nlm.nih.gov/mesh/meshhome.html 
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The transformation causes all features belonging to S(a) to be treated as full 
synonyms of a: 



Ga{X) = [J ga{x) , where 

xGX 



9a{x) 



{a} if a; a 
{x} otherwise. 



( 1 ) 

(2) 



Generalization to a Level. The generalization to a level transformation is 
parametrized by a level of generality n G N. The level Ux is defined inductively 
for all X G T" as follows. Let Ux = 0 for the most general feature x. Let a,b G F. 
Then rib = Ua + 1 for all b such that b < a. The transformation causes all features 
X G F with level of generality rix > n to be mapped to their generalization a G F 
such that Ua = n. This transformation is achieved by the mapping 

= U ln{x) , where (3) 

xGX 

j {a} , Ua = n, X a it Ux > n 



The transformation is closely related to the concept 

introduced by Scott and Matwin ([5]). Note that every L„-transformation can 
be performed as a composition of Go-transformations for all features a such that 
Ua = n. However, the expression power of the Ga-transformation is bigger than 
that of the L„-transformation. 



Omitting a Feature. The transformation causes the feature a, which is the 
parameter of this transformation, to be omitted from the feature multiset, in- 
cluding all specializations of a: 

Oa{X) = IJ Oa{x) , where 

x£X 

0 if a; G S{a) lj{a} 

{a:} otherwise. 

The use of this transformation is related to the “wrapper” approach to feature 
selection (John et ah, [6]), where a set of relevant features is chosen iteratively, 
by greedily adding or removing a single feature until significant decrease in the 
classification performance is observed. This greedy algorithm yields a locally 
minimal set of features that maintain the classification performance of the full 
set of features. 




(5) 

( 6 ) 



2.3 The Algorithm 

We apply here a greedy approach to search for the locally optimal transforma- 
tion. The algorithm assumes the existence of a target function E: Ai K, 
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where A4 is the set of all possible feature mappings. The function evaluates 
the goodness of feature mappings M G M and can be used to compare two 
mappings with respect to a criteria represented by the function E. Let further 
A C j\4 he the set of all defined elementary transformations with all possible 
parameter combinations, let 6* G M be a small threshold constant, and let I G Af 
be the identity mapping. The greedy algorithm that returns a locally optimal 
transformation is presented in Figure 1. 

input: A, E 

output: a mapping M G M 
i < — 0, Mq < — EE < — A 
while |F| > 0 

i ^ i + 1 

M* <— argmaxMer E{M o Mi-i) 

if E{Mi) - E{Mi-i) < e then 
return Mi_i 

end if 
end while 
return Mi 

Fig. 1. A greedy algorithm to compute a locally optimal transformation as the com- 
position of several elementary transformations drawn from the set A 

Let us consider the task introduced in Section 1, i.e., classification 
between documents about eucaryotes and procaryotes. Let F be the terms of 
the MeSH hierarchy and if be a function that evaluates how well a feature 
mapping M helps some classifier to separate the two classes. The set A contains 
all defined elementary transformations, that is, A contains all transformations 
that generalize up to a MeSH term all transformations that omit a 

MeSH term Ox, and all transformations that generalize to a level U^=i 
where N is the depth of the MeSH hierarchy tree. 

The set A contains, among others, also the transformations Gorganisms, 
G Animals, and G Bacteria- The transformation G organisms is obviously harmful, 
as it suppresses the distinction between eucaryotes and procaryotes. The other 
two transformations are probably beneficial for any classifier, given the eucary- 
ote vs. procaryote classification problem, since there is no need to distinguish 
between individual members of the Bacteria or Animal groups: all animals are 
eucaryotes and all bacteria are procaryotes. The transformation could, for ex- 
ample, be G Animals ° Gsacteria ° Gpiants ° Gpungi ° • • • resulting in Combining 
all the various direct specializations of Organisms, but never combining all the 
organisms. 

The transformation in this example will affect the classification in at least 
two ways. Every feature that is a member of, for example, the Animals subtree 
is replaced with the feature Animals. Considering, for example, the maximum- 
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likelihood probability estimate of the naive Bayes classifier, the replacement al- 
leviates the data sparseness problem, because the classifier no longer needs to 
estimate the class-wise probabilities separately for every individual Animals 
feature. Further, when a document instance is being classified and its feature 
multiset contains some animal feature, but the particular animal was not en- 
countered during the training of the classifier, the unknown feature can be used 
in the classification, because due to the transformation G Animals it “inherits” 
the class- wise characteristics of the feature Animals. 

2.4 Evaluation Function E 

The greedy algorithm introduced in Section 2.3 assumes an evaluation function E 
which can be used to evaluate how well a mapping fulfills the criteria represented 
by the function E. Here we define the function E in terms of cross-validated 
classification performance of a classifier using the mapped features on some text 
classification problem. 

Let i? be a set of labeled training examples, and let r S i? be a training 
example. Let further Xr Q E he a multiset of features associated with the 
example r. Let us assume a classifier C: 2^ ^ N that assigns a class label to the 
instance r, based on its associated feature multiset Xr- Then, for each feature 
transformation mapping M, we can define E(M) to be, for example, the average 
accuracy of C when performing a 10-fold cross-validation experiment using the 
set R. For each instance r and its associated feature multiset X^, the class is 
computed as C{M{Xr)). 

3 Application to a Document Classification Problem 

We apply the method to the problem of classifying MEDLINE-indexed docu- 
ments. The set of transformations A is thus instantiated on the MeSH ontology. 
The evaluation function E is defined in terms of the naive Bayes classifier. 

3.1 The MeSH Ontology 

The MeSH (Medical Subject Headings) is the National Library of Medicine’s 
(NLM) controlled vocabulary of medical and biological terms. MeSH terms are 
organized in a hierarchy that contains the most general terms (such as 

) at the top and the most specific terms (such as ) 

at the bottom. There are 21,973 main headings, termed , in the 

MeSH. 

Publications in the MEDLINE database are manually indexed by NLM us- 
ing MeSH terms, with typically 10-12 descriptors assigned to each publication. 
Hence, the MeSH annotation defines for each publication a highly descriptive set 
of features. Of the over 7 million MEDLINE publications that contain abstracts, 
more than 96% are currently indexed. 
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3.2 Feature Extraction from MeSH-Annotated MEDLINE 
Documents 

An occurrence of a term in the MeSH hierarchy is not unique: a term may appear 
more than once in the hierarchy, as a member of different subtrees. For example 
the term appears in the subtrees and , . The MED- 

LINE documents are annotated using MeSH terms, rather than their unique 
subtree numbers, and thus it is not possible to distinguish between the possible 
instances of the term in the MeSH hierarchy. We separate the ambiguous term 
occurrences by renaming them, for example, to and . When 

extracting the features (MeSH terms) of a MEDLINE document, we include all 
possible instances of the ambiguous term occurrence. Thus, a document anno- 
tated with the MeSH term will be represented as having two features: 

and . In the following, we will consider the MeSH hierar- 

chy where all ambiguous occurrences of terms have been renamed and thus a 
term occurrence in this modified MeSH hierarchy is unique. The modified MeSH 
hierarchy contains 39,853 descriptors. Since the MeSH hierarchy consists of 15 
separate trees, we also introduce a single root for the hierarchy. 

3.3 Experimental Setup 

In the empirical evaluation of the method, we consider the following classifica- 
tion problem. We randomly select 10 journals that have at least 2000 documents 
indexed in the MEDLINE database. For each of these 10 journals, 2000 random 
documents were retrieved from MEDLINE. The classification task is to assign a 
document to the correct journal, that is, to the journal in which the document 
was published. The 10 journals form 10 classification datasets, each having 2000 
positive examples and 18000 negative examples formed by the documents be- 
longing to the other 9 journals. The proportion of positive and negative examples 
is thus 1:9 in each of the datasets. 

From the possible elementary transformations presented in Section 2.2, we 
only consider the generalization to a feature presented in Section 2.2, since the 
transformation that omits a feature is closely related to a standard and well re- 
searched feature-selection technique. The generalization up to a level was tested 
in our early experiments, but it proved out to be clearly less effective than the 
generalization to a feature transformation. This is in agreement with the find- 
ings of Scott and Matwin ([5]). However, note that the MeSH hierarchy requires 
10 L-transformations only, whereas up to 11,335 G-transformations need to be 
evaluated in each step of the greedy algorithm.® Adopting the generalization to 
a feature transformations thus increases the computational requirements signif- 
icantly. 

We use the area under the precision-recall curve (AUC) induced by a leave- 
one-out cross-validation experiment using the naive Bayes classifier as the value 
of the evaluation function E. The area under the precision-recall curve is the 
average precision over the whole recall range. The AUC is directly related to 



The modified MeSH tree has depth 10 and 11,335 non-leaf nodes. 
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the well-known 11-point average precision (see, e.g., Witten and Frank ([7])). To 
avoid the variance at the extremities of the curve, we use a trimmed AUC only 
considering the area from 10% recall to 90% recall. We construct the curve by 
ordering the classified documents in descending order by the positive vs. negative 
class probability ratio of the naive Bayes classifier and computing the precision 
and recall values at each of the documents. An important property of the naive 
Bayes classifier is that it allows implementation of an 0{n) complexity leave-one- 
out cross-validation. A fast implementation of the leave-one-out cross-validation 
is necessary, since it is performed in each round of the greedy algorithm for each 
possible elementary transformation. We chose the leave-one-out cross-validation 
scheme to ensure high stability of the function E, avoiding the variance induced 
by the random dataset split in, for example, 10-fold cross-validation. Since most 
of the individual elementary transformations have only a very small effect on the 
performance, it is important to obtain an accurate and stable measure of the 
performance in order to distinguish even small gain from noise. The stopping 
criteria parameter 9 is set 9 = lO”"*. 

We cross-validate the results for each of the 10 journal datasets separately, 
using the 5x2cv cross-validation methodology introduced by Dietterich ([8]). The 
5x2cv test performs five replications of a 2-fold cross-validation. In each fold, 
we use the training set data to build the transformation mapping, as described 
in Section 2.4, and then, using the transformation mapping and the training set 
data, we estimate the performance of the classifier on the test set data. The test 
set data is not used during the search for the mapping nor during the training 
of the classifier. Each of the 5 cross-validation replications consists of two folds. 
For each fold we measure the standard untrimmed AUC, unlike in the case of 
the function E, and then average the AUC of the two folds. The performance of 
the 5 replications is then averaged to obtain the final cross- validated measure of 
the classification performance for one dataset. To test for statistical significance, 
we use the robust 5x2cv test (Alpaydm, [9]), since the standard t-test would give 
misleading results due to the dependency problem of cross-validation. 

As the baseline, we use the naive Bayes classifier with no feature transfor- 
mation applied. In both cases the method introduced by Ng ([10]) was used 
to smooth the maximum-likelihood estimate of the probabilities for the naive 
Bayes classifier. The Ng’s method is commonly applied in text classification 
tasks and it does not interfere with the 0(n) implementation of leave-one-out 
cross-validation for the naive Bayes classifier. 

3.4 Empirical Results 

The results for the 10 datasets are presented in Table 1. 

For 6 datasets, the transformed feature hierarchy results in a statistically 
significant {p < 0.05) increase of the classification performance. Note that for two 
of the other datasets (datasets 2 and 5) the baseline performance is very close 
to 100% leaving little room for significant improvement. For the dataset 2, the 
transformed feature hierarchy results in a negligible decrease of the classification 
performance. 
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Table 1. The classihcation performance of the naive Bayes classiher. First, the 
untrimmed AUC percentages are given for the baseline and transformed features. The 
column denoted A is the improvement over the baseline. The column p is the p- value 
of the 5x2cv statistical significance test. Statistical signihcance (p < 0.05) is denoted 
in bold face. The column rnds is the average number of elementary transformations 
applied by the greedy algorithm, and the column TS is the average size of the trans- 
formed MeSH tree as a percentage of the original size of 39,853 nodes (see Section 3.4 
for discussion) 



Journal ID 


AUC [%] 
Baseline Transf. 


A [%] p 


rnds TS [%] 


1 ActaAnatBasel 


87.15 


88.05 


0.90 0.043 


9.0 


76.5 


2 ApplEnvironMicrobiol 


98.28 


98.26 


-0.02 0.535 


0.2 


99.7 


3 BiolPsychiatry 


95.14 


95.70 


0.56 0.001 


5.0 


80.3 


4 EurJObstetGynecol. 


91.21 


92.31 


1.10 0.006 


8.7 


73.0 


5 FedRegist 


99.48 


99.48 


0.00 undef. 


0.0 


100.0 


6 JPathol 


81.71 


82.94 


1.23 0.003 


13.3 


84.2 


7 NipponRinsho 


65.41 


67.24 


1.83 0.017 


30.4 


75.6 


8 PresseMed 


51.06 


51.38 


0.32 0.503 


31.4 


79.3 


9 SchweizRundschMedPrax 


58.95 


61.53 


2.58 0.029 


25.8 


68.4 


10 ToxicolLett 


88.93 


89.12 


0.19 0.403 


5.5 


92.0 



Depending primarily on the number of transformations taken, the processing 
time varies from 3 minutes (no transformations taken) to 1 hour 45 minutes (44 
transformations taken) for each fold, using a 2.8GHz processor. 

The G-transformation used in the experimental evaluation can also be con- 
sidered in terms of dimensionality reduction, since a Go-transformation causes 
all features in S{a) to be replaced with a, hence the classifier never encounters 
any feature / G S{a). The column TS of Table 1 presents the size of the tree 
when the features / are considered as removed. A reduction to about 80% of the 
tree size can be typically observed. 

4 Discussion 

To study the effect of dataset size on the performance of the method, we repeated 
the experiment for several smaller datasets. We observed, however, no systematic 
behavior of the 10 datasets with respect to dataset size. 

Figure 2 demonstrates a rather good outcome of a classification experiment^ 
for a single dataset, where the precision of the classifier with the transformed 
features is higher than the baseline over the whole recall range. In the exper- 
iments, however, it was often the case that the two curves crossed in at least 
one point, meaning that the transformed features increase the precision only on 
certain intervals of the recall range, while on other intervals the precision de- 
creases. In such a case, the AUC values of the two curves are roughly similar. 



The curves represent a real experiment, however. 




288 



F. Ginter et al. 



yielding a more conservative estimate of the performance than the accuracy at 
any single point on the curve. A full evaluation of the performance of the method 
for a single dataset thus ideally requires an analysis of the full precision-recall 
characteristic of the classifier. 




Precision 

Fig. 2. An example of precision-recall curves of a classifier with transformed and un- 
transformed features. The data is smoothed with a Bezier curve 

The proposed method assumes that the features are organized in some hierar- 
chical ontology. This, however, is not a prohibitive restriction, since any natural 
language text can be mapped to a general ontology such as WordNet. There 
also exists a biomedical ontology, the UMLS, which is provided with a tool for 
mapping unannotated text to the ontology, the MetaMap program. Such a map- 
ping involves several issues, such as ambiguities in the text or the ontology, that 
can introduce errors into the feature extraction process. A further evaluation 
of the method on classification tasks involving features obtained by automatic 
mapping of a free text to an ontology is thus necessary. 

The G-transformation used in the empirical evaluation of the method can be 
viewed as a binary semantic similarity measure, that is, two words can be either 
synonymous or semantically unrelated. This binary approach can be seen as a 
special case of a weighted approach, where the weights expressing the strength 
of a relationship between two words are 1 or 0. Future research can thus be 
directed to devise methods that compute finer-grained similarity representation 
tailored to the problem at hand. 

5 Conclusions 

In this paper we present a novel approach to incorporate semantic information 
to the problems of natural language processing, in particular to the document 
classification task. We devise a theoretical framework in which the semantic 
information is incorporated in the form of ontology-based feature transforma- 
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tions. We introduce several elementary transformations and an algorithm that 
identifies a beneficial transformation as a composition of elementary transfor- 
mations. In order to obtain a feature transformation that is optimized both for 
the data and the classification method used, we define an evaluation function E 
that directs the greedy search in terms of the same classification method that is 
applied to the classification task. This is analogous to the wrapper approach of 
John et al. ([6]). 

To test the method empirically, we apply it to a classification problem on 
MeSH-annotated documents. The empirical results show that the method is 
capable of statistically significant improvement of performance in 6 out of 10 
datasets. In two datasets the improvement was not statistically significant, and 
for the remaining two datasets no significant improvement can be expected due 
to the very high baseline performance. 

While the results indicate that the presented greedy algorithm is sufficient 
to validate the concept of feature transformations, it must repeatedly evaluate 
a potentially large number of elementary transformations, which makes it com- 
putationally expensive relative to the baseline method. Further research should 
thus be directed to devise better search strategies that result in a more efficient 
algorithm. For example, the search space could be reduced by exploiting the fact 
that the features are organized in a hierarchy. The search should also attempt to 
avoid stopping in local optima. Various forms of elementary transformations and 
evaluation functions E can also be studied. The results show that some datasets 
benefit more from the method than others. Further work should therefore be 
directed to study the properties of the data that determine whether a beneficial 
transformation can be found and how big an improvement can be achieved for 
the given dataset. 
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Abstract. Word Sense Disambiguation (WSD) systems are usually eval- 
uated by comparing their absolute performance, in a fixed experimental 
setting, to other alternative algorithms and methods. However, little at- 
tention has been paid to analyze the lexical resources and the corpora 
defining the experimental settings and their possible interactions with 
the overall results obtained. In this paper we present some experiments 
supporting the hypothesis that the quality of lexical resources used for 
tagging the training corpora of WSD systems partly determines the qual- 
ity of the results. In order to verify this initial hypothesis we have de- 
veloped two kinds of experiments. At the linguistic level, we have tested 
the quality of lexical resources in terms of the annotators’ agreement 
degree. From the computational point of view, we have evaluated how 
those different lexical resources affect the accuracy of the resulting WSD 
classifiers. We have carried out these experiments using three different 
lexical resources as sense inventories and a fixed WSD system based on 
Support Vector Machines. 



1 Introduction 

Natural Language Processing applications have to face ambiguity resolution 
problems at many levels of the linguistic processing. Among them, semantic (or 
lexical) ambiguity resolution is a currently open challenge, which would be poten- 
tially very beneficial for many NLP applications requiring some kind of 

, e.g., Machine Translation and Information Extraction/Retrieval 

systems [1]. 

The goal of WSD systems is to assign the correct semantic interpretation 
to each word in a text, which basically implies the automatic identification of 
its sense. In order to be able to address the WSD task, electronic dictionaries 
and lexicons, and semantically tagged corpora are needed. We assume that these 
linguistic resources are fundamental to successfully carry out WSD. 

One of the approaches to WSD is the , in which statistical or Ma- 

chine Learning (ML) techniques are applied to automatically induce, from se- 
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mantically annotated corpora, a classification model for sense disambiguation. 
This approach is typically confronted with the , approach (also 

referred sometimes as in which some external knowledge sources 

(e.g., WordNet, dictionaries, parallel corpora, etc.) are used to devise some 
heuristic rules to perform sense disambiguation, avoiding the use of a manually 
annotated corpus. Despite the appeal of the unsupervised approach, it has been 
observed through a substantial body of comparative work, carried out mainly 
in the Senseval exercises^, that the ML-based supervised techniques tend to 
overcome the results of the knowledge-based approach when enough training 
examples are available. In this paper we will concentrate on the quality of the 
resources needed to train supervised systems. 

We consider that there are two critical points in the supervised WSD process 
which have been neglected, and are determinant when good results want to be 
reached: first, the quality of the lexical sources and, second, the quality of the 
manually tagged corpora. Moreover, the quality of these corpora is determined, 
to a large extent, by the quality of the lexical source used for carry out the 
tagging process. Our research has focused both on the evaluation of three dif- 
ferent lexical sources: (DRAE, [2]), 

MiniDir (MD, [3]), and Spanish WordNet (SWN, [4]), and on how these re- 
sources determine the results of the machine learning-based methods for word 
sense disambiguation. 

The methodology followed for the evaluation of the lexical sources is based 
on the parallel tagging of a single corpus by three different annotators for each 
lexical source. The annotators’ agreement degree will be used for measuring 
the lexical source quality: the more agreement there is, the more quality the 
source will have. Thus, a high agreement would indicate that the senses in the 
lexical source are clearly defined and have a wide coverage. This methodology 
guarantees objectivity in the treatment of senses. 

For measuring the influence of lexical sources in supervised WSD systems, 
we trained and tested a system based on Support Vector Machines (SVM, [5, 6]) 
using each of the lexical resources. Results are compared both straightforwardly 
and after a sense clustering process which intends to compensate for the advan- 
tage of disambiguating against a fine-grained resource such as WordNet lexical 
database or DRAE dictionary. 

The rest of the paper is divided into two main parts. The first one is devoted 
to the analysis of the quality of lexical sources (section 2) and the second one 
aims at testing whether the best results in the first phase correlate with the best 
results obtained by the supervised word sense disambiguation system (section 3) . 



^ This term is rather confusing since in machine learning terminology, unsupervised 
refers to a learning scenario from unnanotated examples (in which the class labels 
are omitted). In that case, the goal is to induce clusters of examples, representing 
the underlying classes. 

^ Senseval is a series of evaluation exercises for Word Sense Disambiguation organized 
by the ACL-SIGLEX. See http://www.senseval.org for more information. 
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Finally, in section 4 we present the main conclusions drawn and some lines for 
future work. 



2 Lexical Resources Evaluation 

Several authors have carried out studies with the aim of proposing specific mod- 
els and methodologies for the elaboration of lexical sources oriented to WSD 
tasks. A very outstanding proposal is that of Veronis [7], in which the validity of 
traditional lexical representation of senses is questioned. This author proposes a 
model of lexical source suitable for WSD based mainly on syntactic criteria. Kil- 
garriff [8] developed an experiment on semantic tagging, with the aim to define 
the upper-bound in manual tagging. In that paper, the upper bound was estab- 
lished at 95% of annotators’ agreement. Krishnamurthy and Nichols [9] analyze 
the process of the gold-standard corpus tagging for Senseval-2, highlighting the 
most common inconsistencies of dictionaries: incorrect sense division, definition 
errors, etc. Fellbaum et al. [10] analyze the process of semantic tagging with a 
lexical resource such as WordNet, but focusing on those features they consider 
as a source of difficulty: the lexical category, the order of the senses in the lexical 
source, and the annotators’ profile. All the authors highlight the importance of 
the lexical source as an essential factor in order to obtain quality results. The 
aim of our research has been to evaluate the quality of lexical resources and 
test its influence in the quality of results of WSD based on machine learning 
techniques. 

The methodology followed in this work for the evaluation of the lexical re- 
sources consists in the manual semantic tagging of a single corpus with three 
different lexical sources: DRAE, MiniDir, and Spanish WordNet. The tagging 
process has been carried out by different annotators. This methodology allows 
us to analyze comparatively the results obtained for each of the lexical sources 
and, therefore, to determine which of them is the most suitable for WSD tasks. 
Our starting point is the hypothesis that the annotator agreement degree is pro- 
portional to the quality level of the lexical resource: the more agreement there 
is the more quality has the lexical source. 

The evaluated lexical sources present very different characteristics and have 
been selected for different reasons. Firstly, we have used the 

, as it is the reference and normative dictionary of 
Spanish language. Secondly, is a lexicon designed specifically for 

automatic WSD. This lexical source contains a limited number of entries (50) 
which have been elaborated specifically as a resource for the Senseval-3 Spanish 
Lexical Sample Task^. Finally, we have also used as sense 

repository, since WordNet is one of the most used lexical resources for WSD. 

We have performed all the evaluation and comparative experiments using the 
following subset of ten lexical entries (see the most common translations into 
English between parentheses). Four nouns: (column), (heart). 



See www.lsi.upc.es/~nlp/senseval-3/Spanish.html for more information. 



3 
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SOURCE:MiniDir-2.1; LEMMA: columna; POS:ncmfs; SENSE:1; DEFlNlTlON:/igttra 
arquitectonica de forma cilmdrica que sirve como soporte o elemento deco- 
rativo’, example; una gran columna de hormigon; una antigua columna del 
tiempo de los romanos-, SYNONYMS:manejar; COLLOCATIONS: columna_corintia, 
columna_de-bronce, columna_de-mdrmol, columna_de-piedra, columna.dorica, 
columna^griega, columna-jonica’, SYNSETS:02326166n/02326665n/02881716n; 

DRAE:1 

SOURCE:MiniDir-2.1; LEMMA: columna\ POS:ncmfs; SENSE:4; DEFlNlTlON:/orma 
cilmdrica que toman algunos fluidos o gases cuando ascienden o cuando 
estdn contenidos en un cilindro; EXAMPLE:Mno densa columna de humo; 
SYNONYMS:?; COLLOCATIONS: co/Mmna_de_ 05 tio, columna^de-humo; SYNSETS: 
08508248n; drae:3/5 

Fig. 1. Example of two Minidir-2.1 lexical entries for columna 

(letter), and (passage). Two adjectives: (blind) and 

(natural). Four verbs: , (to lean/rest; to rely on), (to point/aim; 

to indicate; to make a note), (to exploit; to explode), and (to fly; 

to blow up). See more information on these words in table 2. 

2.1 The Lexical Sources 

In the development of MiniDir-2.1 we have basically taken into account informa- 
tion extracted from corpora. We have used the corpora from the newspapers 

and , with a total of 3.5 million and 12.5 million words, 

respectively, and also Lexesp [11]. The latter is a balanced corpus of 5.5 million 
words, which includes texts on different topics (science, economics, justice, liter- 
ature, etc.), written in different styles (essay, novel, etc.) and different language 
registers (standard, technical, etc.). The corpora provide quantitative and quali- 
tative information which is essential to differentiate senses and to determine the 
degree of lexicalization. As regards the information of the entries of the dictio- 
nary, every sense is organized into the nine following lexical fields: LEMMA, POS 
CATEGORY^, SENSE, DEFINITION, EXAMPLE, SYNONYMS (plus ANTONYMS 
in the case of adjectives), COLLOCATIONS, SYNSETS, DRAE. Figure I shows an 
example of the first and fourth senses of the lexical entry (column) in 

MiniDir-2.1. As Minidir-2.1 has a low granularity, in general, its senses corre- 
spond to multiple senses in Spanish WordNet. For instance, we can observe that 
the sense _ corresponds to three Spanish WordNet synsets (02326166n, 

02326665n, and 02881716n). 

Because of MiniDir2.1 is a lexical resource build up taking into account WSD, 
it includes additional information like examples and collocations. Such informa- 
tion, which is not present in the other sources, is potentially very useful for 
performing word sense disambiguation. 



^ The lexical category is represented by the Eagle tags (Eureka 1989-1995) which have 
been abridged. 
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SOURCE:Drae; lemuA: columna', POS:ncmfs; SENSE:3; DEFlNlTlON:/orma que 
toman algunos fluidos, en su movimiento ascendente. Columna de fuego, de 
humo\ ... SYNSETS:08508248u; MiniDir-2.1:4 

SOURCE:Drae; LEMMA: colwmno; POSincmfs; SENSE:5; DEFlNlTlON:porcion de 
fluido contenido en un eilindro\ ... SYNSETS:08508248u; MiniDir-2.1:4 

Fig. 2. Two simplified DRAE lexical entries for the word columna 

DRAE is a normative dictionary of Spanish language which has not been de- 
signed for the computational treatment of language nor word sense disambigua- 
tion. Entries have been adapted to the format required by the semantic tagging 
editor [12] used in the manually semantic tagging. DRAE presents also a high 
level of granularity and overlapping among definitions. Many senses belong to 
specific domains and it is also frequent to find outdated senses. Figure 2 contains 
an example of DRAE entries for senses 3 and 5 of the word (columna) . 

The third lexical source we have used is the Spanish WordNet lexical database. 
It was developed inside the framework of EuroWordNet [4] and includes paradig- 
matic information (hyperonymy, hyponymy, synonymy, and meronymy). As it 
is well known, this lexical knowledge base is characterized by its fine granular- 
ity and the overlapping of senses, which makes more difficult the annotation 
process. Spanish WordNet was developed following a semiautomatic methodol- 
ogy [4], which took as reference the English version (WordNet 1.5). Since there 
is not a one to one correspondence between the senses of both languages, some 
mismatches appeared in the mapping process. In spite of Spanish WordNet has 
been checked many times, some mismatches remain and this explains the lack 
of some senses in Spanish and the excessive granularity for others. 

2.2 The Tagging Process 

The annotated corpus used for evaluating the different lexical sources (DRAE, 
MiniDir 2.1 and Spanish WordNet) is the subset of the MiniCors [13] corpus 
corresponding to the ten selected words. MiniCors was compiled from the corpus 
of the EFE Spanish News Agency, which includes 289,066 news spanning from 
January to December of 2000®, and it has been used as source for the Senseval-3 
Spanish Lexical Sample task [14]. The MiniCors corpus contains a minimum 
of 200 examples for each of the represented words. The context considered for 
each word is larger than a sentence, as the previous and the following sentences 
were also included. For each word, the goal was to collect a minimum of 15 
occurrences per sense from available corpora, which was not always possible. At 
the end, only the senses with a sufficient number of examples were included in 
the final version of the corpus. 

The tagging process was carried out by experienced lexicographers and it 
was developed individually, so as to avoid interferences. Also, the authors of the 



® The size of the complete EFE corpus is 2,814,291 sentences, 95,344,946 words, with 
an average of 33.8 words per sentence. 
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dictionary did not participate in the tagging process. In order to systematize and 
simplify the annotation process to the utmost, a tagging handbook specifying 
annotation criteria was designed in an initial phase [12], and a graphical Perl-Tk 
interface was programmed in order to assist the tagging process. See [14] for 
more details on the construction and annotation of the MiniCors corpus. 

The 10-word subset of MiniCors treated in this paper has been annotated 
with the senses of DRAE and Spanish WordNet, in addition to the MiniDir-2.1 
original annotations. Again, each word has been annotated by three different 
expert lexicographers in order to facilitate the manual arbitration phase, which 
was reduced only to cases of disagreement. The annotators could assign more 
than one tag to the same occurrence in order to reflect more precisely the different 
agreement degrees. 

2.3 Evaluation and Arbitration 

Once the corpus has been tagged, we have carried out a comparative study 
among the different annotations and the subsequent evaluation of the results 
in order to obtain a disambiguated corpus to begin with the evaluation of the 
lexical sources. Since each word has been tagged three times for each lexical 
source, the subsequent process of arbitration has been reduced to those cases of 
disagreement among the three annotators. 

We distinguish 4 different situations of agreement/disagreement between an- 
notators: , , , and 

. Total agreement takes place when the three annotations completely match 
(e.g.: 1, 1, 1 1). When not all the annotations match but there is a individual 

sense assigned by all annotators we get partial agreement (e.g.: 1, 1, 1/2 => 1; 
1/2, 1/2, 1 1). Minimum agreement occurs when two annotations match but 

the other one is different (e.g.: 1, 1, 2 1). Finally, disagreement is produced 

when none of the annotations match. These agreement cases, either total, par- 
tial or minimum, are validated automatically according to the pattern we have 
previously defined. Only cases of disagreement undergo a manual arbitration 
phase. We have considered also the pairwise agreements between annotators for 
the analysis of results. The measure Pairwise Agreement counts the average of 
the agreement levels between each pair of annotators. In this case, we distin- 
guish among (cases of total agreement among 

every pair of annotators) and (cases of partial 

agreement among each pair of annotators). 

Table 1 shows the results obtained on each of the previous measures for each 
sense repository and for each POS category. is the average number 

of senses assigned by the annotators. 

2.4 Analysis of the Results 

The tagging experiments presented in table 1 show that the lexical source which 
has been designed with specific criteria for WSD, MiniDir-2.1, reaches much 
higher Total Agreement levels in the manual tagging of corpus than Spanish 
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Table 1. Per POS category and global annotation agreements using Spanish WordNet, 
MiniDir-2.1, and DRAE sources 



Nouns 


SWN 


MD-2.1 


DRAE 


Adjectives 


SWN 


MD-2.1 


DRAE 


TotAgr 


0.64 


0.83 


0.57 


TotAgr 


0.15 


0.67 


0.24 


PartAgr 


0.12 


0.03 


0.18 


PartAgr 


0.42 


0.06 


0.51 


MinAgr 


0.20 


0.14 


0.23 


MinAgr 


0.33 


0.26 


0.23 


DisAgr 


0.04 


0.00 


0.02 


DisAgr 


0.10 


0.01 


0.02 


MaxPairAgr 


0.83 


0.90 


0.83 


MaxPairAgr 


0.70 


0.81 


0.84 


MinPairAgr 


0.72 


0.88 


0.70 


MinPairAgr 


0.32 


0.77 


0.69 


NumSenses 


1.10 


1.02 


1.08 


NumSenses 


1.56 


1.03 


1.12 



Verbs 


SWN 


MD-2.1 


DRAE 


Overall 


SWN 


MD-2.1 


DRAE 


TotAgr 


0,34 


0,66 


0,53 


TotAgr 


0,42 


0,72 


0,45 


PartAgr 


0,30 


0,08 


0,08 


PartAgr 


0,25 


0,06 


0,25 


MinAgr 


0,34 


0,25 


0,36 


MinAgr 


0,28 


0,21 


0,28 


DisAgr 


0,02 


0,01 


0,03 


DisAgr 


0,05 


0,01 


0,02 


MaxPairAgr 


0,78 


0,83 


0,74 


MaxPairAgr 


0,77 


0,85 


0,80 


MinPairAgr 


0,47 


0,76 


0,67 


MinPairAgr 


0,50 


0,80 


0,69 


NumSenses 


1,53 


1,03 


1,05 


NumSenses 


1,39 


1,03 


1,08 



WordNet or DRAE, which stand for lexical sources of common use. The worst 
results have been obtained by Spanish WordNet, being slightly worse than those 
of DRAE. We can also analyze the results obtained through three related dimen- 
sions: the disagreement measure, the overlapping degree between senses, and the 
number of senses per entry. 

Regarding the disagreement measure, Spanish WordNet has the highest score, 
0.05, in front of the 0.02 from DRAE and 0.01 from MiniDir-2.1. That means 
that the arbitration phase in MiniDir-2.1 and DRAE has been done almost au- 
tomatically, whereas in the case of Spanish WordNet more manual intervention 
has been applied. In Spanish WordNet and DRAE we find a high level of over- 
lapping between senses because these dictionaries are very fine grained. These 
characteristics are reflected in the high numbers for the Partial Agreement mea- 
sure (compared to MiniDir-2.1) and in the big differences between Maximum and 
Minimum Pairwise Agreement. This is partially a consequence of the fact that 
the 1.39 average number of senses assigned to each example in Spanish WordNet 
is the highest one compared to 1.08 from DRAE and 1.03 from MiniDir-2.1. 

If we evaluate the results according to lexical categories, nouns achieve the 
highest levels of agreement probably because of their referents are more stable 
and clearly identifiable. As regards adjectives and verbs, the levels of agreement 
are lower, specially in Spanish WordNet. 

The annotation with MiniDir-2.1 reaches results considerably acceptable (with 
an overall agreement higher than 80% if we sum Total and Partial Agreement 
cases) that prove their adequacy for WSD tasks. Among the MiniDir-2.1 charac- 
teristics that could explain the better results in the annotators agreement degree 
we should point out the fact that it contains both syntagmatic and co-occurrence 
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information, that constitute determining factors in order to help annotators to 
decide the correct sense, as it can be seen in the entries for presented 

in figure 1. 



3 Automatic Disambiguation Experiments 

A supervised word sense disambiguation system based on Support Vector Ma- 
chines has been trained and tested using each of the three lexical resources. 
This system is the core learning component of two participant systems to the 
Senseval-3 English Allwords and Lexical Sample tasks, which obtained very com- 
petitive results [6, 15]. 

Support Vector Machines is a learning algorithm for training linear classifiers. 
Among all possible separating hyperplanes, SVM selects the hyperplane that sep- 
arates with maximal distance the positive examples from the negatives, i.e., the 
max;imal margin hyperplane. By using kernel functions SVMs can be used also 
to efficiently work in a high dimensional feature space and learn non-linear clas- 
sification functions. In our WSD setting, we simply used a linear separator, since 
some experiments on using polynomial kernels did not provide better results. We 
used the SVM**®^* freely available implementation by Joachims [5] and a simple 
one-vs-all binarization scheme to deal with the multiclass classification WSD 
problem. 

Regarding feature representation of the training examples, we used the Fea- 
ture Extraction module of the TALP team in the Senseval-3 English Lexical 
Sample task. The feature set includes the classic window-based pattern features 
extracted from a ±3-token local context and the “bag-of-words” type of features 
taken from a broader context. It also contains a set of features representing the 
syntactic relations involving the target word, and some semantic features of the 
surrounding words extracted from the Multilingual Central Repository of the 
Meaning project. See [15, 6] for more details about the learning algorithm and 
the feature engineering used. 

We have been working with a total of 1,536 examples, which are the examples 
in the intersection of the three annotation sources. That means that some ex- 
amples had to be eliminated from the original Senseval-3 sets, since they could 
not be assigned to any sense either in the DRAE or Spanish WordNet sense 
repositories. The training and test partitions have been obtained by randomly 
selecting 2/3 and 1/3 of the total number of examples, respectively. The total 
number of training examples is 1,094, while the number of test examples is 543. 
The number of observed senses for these 10 words (ambiguity rate) range from 3 
to 13 depending on the lexical source. Note that, though the DRAE and Span- 
ish WordNet are much more fine-grained than MiniDir-2.1, the difference in the 
number of senses actually observed in the examples is not dramatic (7.9 and 7.8 
versus 5.7). Moreover, the average number of senses according to DRAE and 
Spanish WordNet are almost identical. See more information about the individ- 
ual words in table 2. 
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Table 2. Basic information about the 10 selected words for training and evaluating 
the SVM-based WSD system 



word 


POS 


Number of senses 
DRAE MD-2.1 SWN 


examples 
/fitrain #test 


apoyar 


V 


5 


3 


6 


140 


51 


apuntar 


V 


8 


9 


7 


124 


54 


ciego 


a 


8 


5 


7 


83 


49 


columna 


n 


7 


8 


9 


127 


63 


corazon 


n 


8 


6 


8 


113 


58 


explotar 


V 


6 


5 


7 


131 


53 


letra 


n 


10 


5 


7 


92 


63 


natural 


a 


9 


6 


13 


92 


46 


pasaje 


n 


11 


4 


7 


87 


53 


volar 


V 


7 


6 


7 


105 


53 


avg. /total 


- 


7.9 


5.7 


7.8 


1,094 


543 



The multiplicity of labels in examples (see the ‘NumSenses’ row in table 1) 
has been addressed in the following way. When training, the examples have been 
replicated, one for each sense label. When testing, we have considered a correct 
prediction whenever the proposed label is any of the example labels. 

The overall and per- word accuracy results obtained are presented in table 3. 
For each lexical source we include also the results of the baseline Most Frequent 
Sense classifier (MFS). It can be seen that the MFS results are fairly similar 
for all three annotation sources (from 46.78% to 47.88%), while the SVM-based 
systems clearly outperforms the MFS classifier in all three cases. The best results 
are obtained when using the MiniDir-2.1 lexical source (70.90%), followed by 
DRAE (67.22%) and Spanish WordNet (66.67%). This accuracy represents an 
increase of 24.12 percentage points over MFS and an error reduction of 45.32%. 



Table 3. WSD results using all three sense repositories: DRAE, MD-2.1, and SWN. 
Columns 3, 5, and 7 contain the results of the MFS baseline (mosty-frequent sense 
classifier). Columns 4, 6, and 8 contain the results of the SVM-based system 



DRAE MD-2.1 SWN 



word POS 



apoyar 


V 


apuntar 


V 


ciego 


a 


columna 


n 


corazon 


n 


explotar 


V 


letra 


n 


natural 


a 


pasaje 


n 


volar 


V 


average 


- 



MFS %ACC. MFS %ACC. MFS %ACC. 



92.16% 92.16% 
55.56% 66.67% 
57.14% 71.43% 
22.22% 74.60% 
37.93% 58.62% 
43.40% 50.94% 
39.68% 61.90% 
58.70% 73.91% 
35.85% 60.38% 
47.17% 64.15% 



88.24% 84.31% 
46.30% 68.52% 
61.22% 75.51% 
20.63% 79.37% 
43.10% 67.24% 
43.40% 69.81% 
34.92% 60.32% 
47.83% 65.22% 
39.62% 77.36% 
52.83% 62.26% 



80.39% 68.63% 
59.26% 85.19% 
48.98% 71.43% 
38.10% 74.60% 
46.55% 65.52% 
41.51% 64.15% 
41.27% 53.97% 
34.78% 50.00% 
37.74% 64.15% 
41.51% 67.92% 



47.88% 67.22% 46.78% 70.90% 46.78% 66.67% 
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Compared to the other lexical sources, the differences in favor of MiniDir-2.1 
are statistically significant with a confidence level of 90% (using a z-test for the 
difference of two proportions). The difference between MiniDir-2.1 and Spanish 
WordNet is also significant at 95%. These results provide some empirical evidence 
which complements the one presented in the previous section. Not only human 
annotators achieve a higher agreement when using MiniDir, but also a supervised 
WSD system obtains better results when using this source for training. 

Nevertheless, the advantage could be due to the fact that MiniDir-2.1 (5.7 
senses/word in average) is a bit coarser grained than DRAE (7.9 senses/word) 
and WordNet (7.8) on the ten considered words. To compare the lexical resources 
on a more fair basis, it seems that a new evaluation metric is needed able to 
compensate for the difference on the number of senses. As a first approach, 
we clustered together the senses from all lexical sources, following the coarsest 
of the three (MiniDir-2.1). That is, each DRAE and Spanish WordNet sense 
was mapped to a MiniDir-2.1 sense, and any answer inside the same cluster was 
considered correct. This procedure required some manual work in the generation 
of the mappings between lexical sources. Some ad-hoc decisions were taken in 
order to correct inconsistencies induced by the more natural mappings between 
the three sources. 

The evaluation according to the sense clusters leaded to some disappointing 
results. The best overall accuracy results were obtained by DRAE (72.62%), 
followed by Spanish WordNet (71.19%) and MiniDir-2.1 (70.48%). However, it 
is worth noting that none of this differences is statistically significant (at a 
confidence level of 90%). It remains to be studied if this lack of actual differences 
is due to the small number of examples used in our experiments, or to the 
fact that the dictionary used is not really affecting very much the achievable 
performance of supervised machine learning WSD systems. The way in which we 
addressed the problem of the multiple sense labels per example (see table 1 and 
above) may tend to favor the evaluation of the most fine-grained lexical sources 
(Spanish WordNet and DRAE), and partly explaining the lack of differences 
observed. We think that the design of other evaluation measures, independent of 
the number of senses and able to isolate the contribution of the lexical sources, 
deserves also further investigation. 



4 Conclusions 

In this study we have evaluated different lexical sources in order to determine the 
most adequate one for WSD tasks. The evaluation has consisted of the tagging 
of a single corpus with three different dictionaries and different annotators. The 
agreement degree among the annotators has been the determining criteria to 
establish the quality of the lexical source. 

According to our experiments, MiniDir-2.1, the lexical source designed with 
specific criteria for WSD, reaches much higher agreement levels (above 80%) 
in the manual tagging of the corpus than Spanish WordNet or DRAE. The 
MiniDir-2.1 specific features that help explaining these differences are the fol- 
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lowing: 1) MiniDir-2.1 is coarser grained than DRAE and Spanish WordNet, 
avoiding to some extent the overlapping of senses; 2) It contains both syntag- 
matic and co-occurrence information, which help the annotators to decide the 
correct senses. 

The evaluation of a SVM-based WSD classifier, trained on the three different 
lexical resources, seems to indicate that a reference dictionary with a higher 
agreement degree produces also better results for automatic disambiguation. 

We also provide results of a first attempt in trying to evaluate the WSD sys- 
tems with independence of the average number of senses per word, by means of a 
sense mapping and clustering across lexical sources. Unfortunately, these results 
showed no significant differences among lexical sources. Up to now, it remains un- 
clear whether the increase in performance produced by the use of a lexical source 
specifically designed for WSD is mainly explained by the the higher quality of 
the lexical source or by the decrease on sense granularity. This is an issue that 
requires further research, including experiments on bigger corpora to produce 
statistically significant results and a careful design of the evaluation metrics used. 
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Abstract. We propose a method to develop a meta system of multilingual ma- 
chine translation by reusing free online automatic translation engines. This system 
can process and translate a homogeneous or heterogeneous document (multilin- 
gual and multicoding). This system's purpose is to identify the language(s) and the 
coding(s) of the input text, to segment a heterogeneous text into several homoge- 
neous zones, and to call a better MT engine for the target and source language pair 
and to retrieve the translated results in the desired language. This system can be 
used in several different applications, such as multilingual research, translation of 
the electronic mails, construction of multilingual Web sites, etc. 



1 Introduction 

Currently, there are several free online MT engines, like Systran, WorldLingo, Re- 
verso... but these free versions limit the length of the entry text to less than 50-150 
words. These engines only allow translating the monolinguals and monocoding texts 
or Web pages with a language pair determined in advance. 

With the widespread use of Internet, we can receive the information written in sev- 
eral different languages (electronic mails, technical catalogues, notes, Web sites, and 
etc.) and the need for the translation of these texts in the mother tongues of users 
naturally arises. We can also receive information of which we don't know the lan- 
guage used in the text. 

Moreover, the translation quality is a problem which users concern [3], [13]. To 
translate an English document into French, one can choose either Systran or the other 
engines (Reverso, WorldLingo, etc). How do we choose from the existing MT 
engines for our document? 

Our principal idea is to construct a system which uses free online translation 
engines and to add on this system necessary functions such as, the identification of 
the language and the coding of the entry text, the segmentation of a heterogeneous 
text into homogeneous zones, the choice of a better translation engine for a language 
pair determined in advance and the parameterization in calling the translation engines. 

We here present a method to develop a meta system of multilingual machine trans- 
lation by reusing the free online automatic translation systems [15]. In the first part, 

J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 303-313, 2004. 
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we present the general architecture of our system to translate automatically online 
documents. Next, we present an N-grams method of automatic identification of the 
language and coding of a text, and then the method of segmentation of a heterogene- 
ous text in homogeneous zones. Furthermore we present a solution to choose a better 
MT engine for a source and target language pair based on the evaluation of the en- 
gines by the methods BLEU and NIST. Finally, we present a method to parameterize 
the MT engines and to retrieve the results. We also combine several MT engines and 
use English as pivot language to obtain a maximum number of language pairs. This 
system was adapted for several different systems of coding like BIG-5 (traditional 
Chinese), GB-2312 (simplified Chinese), Shift- JIS (Japanese), EUC-Kr (Korean), 
KOl-8, CP-1251 (Russian), etc. We can apply this system in several different do- 
mains such as multilingual research, the translation of the electronic mails, the con- 
struction of the multilingual Web sites, etc. 



2 Objectives 

Our first objective is to construct a tool to call these translation engines and to retrieve 
their results. We can use the parameters such as: language (source and target), coding, 
name of the translation engine to obtain different translation results of the same entry 
text. The second objective is to use the translation engines to develop multilingual 
applications. We can integrate these engines in a multilingual systems to translate 
texts, messages by executing out programs. We developed a Web site to translate the 
heterogeneous texts (containing multilingual and multilicoding segments) into the 
target language, even we don’t know the language of the entry text. This tool makes it 
possible to translate any text into the desired language. If this text is heterogeneous, 
the system will segment the text and identify the language and the coding of each 
segment. For example, we can copy/paste a text containing several languages (Eng- 
lish, Chinese, Japanese, etc.) and choose, for example, French as the target language. 
Or, we can look up sites containing a given word (in particular, technical terms) for 
example, if we enter the word “segment” as our keyword of research on Google, the 
result could contain the Web sites in several languages, such as French, English, 
German, etc, simply because they all happened to have used this word somewhere in 
their Web page. We can use this tool to read the results in the same language. We can 
also integrate this tool in several different applications such as the translation from the 
electronic mails, the generation of a message in several languages, the evaluation of 
quality of the MT engines, etc. 



3 Free Online MT Engines 

In this section we present some recent machine translation engines. We can access the 
systems to translate texts. 
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3.1 Systran 

Currently, SYSTRAN is a well known translation engine and its technology ranges 
from the solutions of translation for Internet, PC and infrastructures of network with 
36 language pairs and 20 different specialized fields. Systran is a translation engine 
used by European Community or NASA since long time, more recently, by some 
important Internet entry gates, such as AltaVista or Google. The online version can 
translate for 34 language pairs. It can be accessed at the address http:// 
www.systranbox.com/. 

3.2 Gist-in-Time 

This is a tool developed by Alls Technologies Inc., for the Web environment and 
online companies. The users of Gist-In-Time benefit from the most level of compre- 
hension available in Internet at present. It provide 17 language pairs (English <> 
Erench, English <> German, English <> Spanish, etc). We can access this translation 
system on the site of Gist-ln-Time: http://www.teletranslator.com:8100/cgi-bin/ 
transint.Fr.pl?AlisTargetHost=localhost. 

3.3 Reverse 

This tool can function in PC environment, Internet, Intranet, either as an autonomous 
application (for example, text processing) or as a translation engine directly integrated 
in an application (for example, in Word or Excel). The address of the site of Reverso 
is http://www.reverso.net/textonly/default.asp. 

3.4 FreeTranslation 

FreeTranslation is a product of SDL International, which provides the service of 
translation and localization of the applications. The translation server of SDL pro- 
vides 15 language pairs on the site http://www.freetranslation.com/. 

3.5 IBM Alphaworks 

Alphaworks is an online automatic translation system, that dedicates its service in 
online Web page translation. The input data is actually the URL of the Web page 
starting with: http://www. It can translate Web pages into 12 language pairs and the 
address of the site of translation of Alphaworks is http://www.alphaworks.IBM.com/ 
aw.nsf/html/m. 



4 System Description 

4.1 Structure of the System 

We developed a tool to automatically translate texts (multilingual or monolingual) using 
the existing online translation engines. The architecture of the system is as follows: 
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Fig. 1. Meta-system of multilingual machine translation 



The input text can be a multilingual text or multicoding (one can copy/paste input 
data from several different Web sites, or the received texts can be heterogeneous). We 
extract each paragraph to diagnose the language and coding, and to send to the trans- 
lation servers since these servers often accept only the text no longer than 150 words. 
If the paragraph is monolingual, we immediately send it to the servers, otherwise it is 
necessary to segment it in monolingual zones, each one corresponding to one or more 
sentences. 

4.2 Diagnostic and Segmentation 

We developed a tool SANDOH (System of Analysis of the Heterogeneous Docu- 
ments) [14] to analyze a heterogeneous text by using the method of N-grams and 
method of the progressive segmentation to segment a heterogeneous text in homoge- 
neous zones. The result of the analysis is the couple <language, coding> if this docu- 
ment is homogeneous, otherwise, the zones and the couple <language, coding> used 
in each zone jzone-l, language-1, coding-1}, [zone-2, language-2, coding-2),..., 
(zone-n, language-n, coding-n). A demonstration of this tool is accessible online at 
the following address http://www-clips.imag.fr/geta/User/hung.vo-trung/id_langue/ 
web_fr/index.htm. To segment a text, we use a technique of progressive segmentation 
based on the punctuation marks. Normally, the parts written in different languages are 
separated by punctuation marks such as the period, the indent, the colons, the bracket, 
the quotation mark, the point of exclamation, the semicolon, etc. The idea here is that 
after having evaluated a zone, if it is heterogeneous, one continues to separate this 
zone into smaller zones and to further evaluate these zones. But it should be made 
sure that these zones will not be too short for the identification. Initially, we evaluate 
each paragraph (the termination of each paragraph is the sign of End of Line “EOL”) 
to detect whether the paragraph is homogeneous or heterogeneous. If it is homogene- 
ous, one continues analyzing next paragraph, if not, it is necessary to segment this 
paragraph into two zones. It is enough to separate the text in the medium and then to 
test what happens when one moves of a word towards the left or the right-hand side 
until obtaining that each zone contains a whole of sentences. We continue to evaluate 
and separate this zone in smaller zones and to evaluate until we obtain a homogene- 
ous zone. The identification of the language and coding is carried out as the diagram 
below shows [12]: 
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Fig. 2. General architecture of an identifier of language and coding 

The phase of training is built on the basis of statistical model. At the beginning, 
one must have files annotated by beacons <language, coding>. The module of train- 
ing then creates the "models" corresponding to each couple <language, coding>. 
These models are built based on the frequency of the sequences, which one counted 
in the file of training. Then we can fusion all these model files into a unified file that 
contains all the language and coding distribution models. This file will be used to 
identify the other texts in the future. The phase of identification will determine in 
which language a text is written and with which coding. It uses the same method by 
comparing segments in the text to be analyzed with sequences of the model of lan- 
guages to evaluate the text to be analyzed. Here is an example of diagnostic result 
(table of scores) of a French text "les chiens et les chats sont des animaux" and a 
Chinese text $ x S !fs : ff ^ iS IT P B -t with the 10 highest scores; 

Table 1. Scores of diagnostic result 



French text 




Chinese text 


Language-encoding 


Score 




Language-encoding 


Score 


French 


255.8934 




Chinese-gb2312 


344.3639 


Catalan 


236.9501 




Chinese-big5 


234.7359 


English 


231.5291 




Korean 


222.3591 


Breton 


211.8286 




Arab-windows 1256 


170.5808 


Latin 


195.9060 




Tamoul 


169.7671 


Slovenian-iso8859_2 


195.1329 




Arab-iso8859_6 


140.7884 


Irish 


183.8261 




Ukrainian-koi8_r 


135.7941 


Quechua 


167.9354 




Japanese-shift_jis 


128.0966 


Slovak- windows 1250 


167.8815 




Thai 


119.0824 
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For a multilingual text, for example, “Life is rarely as we would like it to be rather 
it is exactly as it is: Cest la vie!”, the result will be a text in the form: 
<En-CP1252>Life is rarely as we would like it to be rather it is exactly as it is: 
</En-CP1252xFr-CP1252> C’est la vie!</Fr-CP1252>. 

For diagnosis of the language and coding, we might need to convert the coding of 
the text to the system of coding accepted by the translation server. 

4.3 Selection of the Translators 

We reuse two free online translation engines and use English as the pivot if the direct 
translation source-target couple does not exist. Indeed, if a system like Systran, Gist- 
in-Time, Reverso... provides the service in the language L, it always has the pair 
(L , English). Of course, the "double" translations are definitely of lower quality than 
the direct translations. But the result of translation is generally still readable and us- 
able. For example, to translate a French text into Japanese, we can compose two 
translators French-English and English-Japanese (of the translation server Systran). In 
the same way, to translate an Arabic text into French, we can compose the bilingual 
engine Arab-English (FreeLanguageTranslation) and English-French (Systran). 

In order to choose a better translation engine for a source and target language pair, we 
evaluated the translation quality of the engines of each language pair. Then we use the 
results obtained of the evaluation as a reference of choice of an engine for a language 
pair to translate [5]. For evaluation of the translation engine, we use two well known 
methods: BLEU and NIST [10]. BLEU is a technique to evaluate MT engines, which 
was introduced by IBM at July 2001 in Philadelphia [9]. The principal idea is to com- 
pare the result of MT engines with the expert translations of reference in terms of statis- 
tics of the short orders of the words (N-grams of word) [4]. It showed a strong correla- 
tion between these automatic generated scores and human judgments for the quality of 
translation. The evaluation employs statistics of co-occurrence of N-grams requires a 
corpus to be evaluated and translations of reference of high quality [8]. The algorithm of 
IBM marks the quality of MT in terms of nap of matched N-grams and also includes a 
comparison length of translations and that of reference. The score of co-occurrence of 
N-gram is typically carried out segment-by-segment, where a segment is the minimum 
unit of the agreement of translation, usually one or several sentences. The statistics of 
co-occurrence of N-grams, based on the sets of N-grams for the segments of translation 
and reference, are calculated for each one of these segments and then accumulated up to 
all the segments [10], [16]. Method NIST is derived from the criterion of evaluation of 
BLEU but differs in a fundamental aspect: instead of the precision of N-grams, the 
profit of the information of each N-gram is taken into account [7]. The idea is to give 
more credit, if a system obtains an agreement of one N-gram. This makes the scores also 
sensitive to the differences proportional in the co-occurrence for all the N-grams. Con- 
sequently, there exists the potential of the against-productive dissension due to low co- 
occurrences for the larger values of N-grams. An alternative should employ an arithme- 
tic mean of accounts of N-grams. 

We developed a tool to automatically evaluate the translation engines on the basis 
of available corpus like the Bible and BTEC [1]. 

The evaluation of the translation engines is carried out as shows the following diagram: 
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Fig. 3. Diagram of evaluation of the translation engines 

Here is an example of the result of evaluation of two MT engines Systran and 
Reverso on the corpus Bible [11]: 

Table 2. Scores of the evaluation of the online translation engines 



Coiqtle oflaiiguages 


Systran 


Reverso 


BLEU 


NIST 


BLEU 


NIST 


Spain sIt-^ Englisli 


0,1322 


3,5117 


0,1257 


3,3567 


Englidi Spaiidi 


0,0962 


3,2985 






French Englisli 


0,1277 


3,4968 


0,1276 


3,4010 


En^idi French 


0,1163 


3,1208 


0,0996 


3,1349 



Comparison of the scores BLUE 



0,14 

0,12 

0,10 

0,08 

0,06 

0,04 

0,02 

0,00 



■ 



□ Systran 



es.en en.es fr_en en_fr 

Language pairs 



Comparison of the scores NIST 




es.en en.es fr.en en.fr 
Language pairs 



□ Systran 



Fig. 4. Comparison chart of the scores of NIST and BLEU 
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We can see that the translation quality of Systran is a little better than Reverso on 
the evaluated language pairs, so we should choose the Systran translator for the 
language pairs SpanishoEnglish and EnglishoFrench. 

4.4 Sending the Requests 

After having determined the server of adequate MT, we then send a request to the 
translation server. The obtained result is a file in the HTML format. That is carried 
out by the function get_doc( ) that we wrote in Perl. This function will parameterize 
the MT engines. To translate a text, the program calls this function on each unit of 
translation determined by the segmentation with the parameters such as: URL of the 
Web site of translation, contents of the segment to be translated, source and target 
language pair. For example, to translate a text on the Systran server, we call get_doc() 
in the following way: 

@res=get_doc ( "www. systranbox. com/ systran/box?systran_te 
xt=$phrase [$j ] &systran_lp=$ls_lc" ) ; 

- www.systranbox.com/systran/box: URL of server Systran 

- systran_text = $phrase[$j] : contents of the segment to be translated 

- systran_lp = $ls_lc : source language and target language (for example, $ls_lc = 

"fr_en” to translate a French text into English). 

The servers of MT often accept different coding for the same language (for 
example, EUCjp or Shift-JIS for Japanese, ISO-8958-1 or Unicode for French). It is 
necessary to convert the coding of the text into the accepted coding of the server 
before sending a request. 

4.5 Treatment of the Results 

The result obtained after the server processed a request is generally an HTML file. 
We must further process this file to extract a true result from translation. For this 
process, we use HTML tags to identify the position of the translated text. The result 
obtained is a string in target language. The coding applied to this text depends on the 
language. For example, the system of coding GB-2312 for simplified Chinese, Shift- 
JIS for Japanese, EUC-Kr for Korean, CP 1251 for Russian, etc. To visualize this text 
correctly, we can choose one of the following solutions: to visualize of the text in the 
preexistent coding system or to convert into a predetermined coding system. For 
visualization in a preexistent coding, we can fix the system of coding of the Web site. 
For example, with Japanese we choose the system of Shift_JIS coding by the instruc- 
tion: "<meta http-equiv=Content-Type Content = text/html; 

charset = shif t_j is> " if ($lang eq "ja"). For visualization in a pre- 
determined coding, we must convert the text of current coding to the predetermined 
coding system. For example, to convert the coding of a text, we can use the function 
of the conversion of coding encode (ENCODING, $string>). For example, to con- 
vert the character string $result into UTF-8, we use the instruction: 
$string_utf 8=encode ( "utf 8" , $result) ; 
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5 Experimentation 

We built a Web site which could automatically translate 1 1 languages with 55 differ- 
ent language pairs into simple translation. We can enter the text directly in a textbox 
to be translated, or open the text in the files stored on the local disk. The length of the 
text for translation is not limited. This text can he monolingual or multilingual. For 
example, after searching for the word "segment" on Google, we receive the results 
and use them as input data of our Web site to obtain a multilingual text and choose 
the language: For example, here are two URL’s (one is for a Weh site in Russia, the 
other is in English): 

www.segment.ru/Eo3a doHHux npouseodumeneii u nocmaeufUKoe KomjejinpcKux,... 

www.segpub.com.au/ Segment Publishing is a Sydney, Australia-based company 
who specialise in standards-compliant Web development and FreeBSD-powered Web 
hosting... 

After we pass the two URL’s to our system and the system translates these text into 
French (by calling other MT engines) the result is as follows: 

www.segment.ru/ La base des donnees des producteurs et des foumisseurs du 
bureau,... 

www.segpub.com.au/ L'edition de segment est Sydney, compagnie Australie-basee 
qui se specialise dans le developpement norme-conforme de site Web et I'accueil 
FreeBSD-actionne de site Web. 

We are now writing a program to process the format of the file before translating 
it. This function will allow preprocessed corpora using several different formats (rtf, 
HTML, xhtml, xml...). This module will transform all these files into a single format 
(a class of documents XML) for translation. 

6 Applications 

We can apply this system in several different fields such as: multilingual information 
retrieval, the translation of the electronic mails, the construction of multilingual Web 
sites, etc. For us, this system has been applied in the internationalization of the appli- 
cations with three following principal aspects: 

Localization of the Software. We are thinking to apply these systems to two levels. 
On the first level, it can be the issue of translating messages, files of messages, 
menus, help files, documents in localization of the software (for example, to tran- 
scribe the files of messages in ARIANE G5). On the second level, one can use it as an 
internal module in multilingual systems, to directly translate the execution messages 
from a language to another. The application of the translators to the second level will 
make the management of the user interface easier. We will then be able to build a 
multilingual software, which includes only one code of program, the catalogues of 
messages in only language (for example in English), a module of translator and 
sources linguistic (for the local languages). 
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Integration in the Communication Systems on Internet. We also envisage to use this 
tool in the systems of electronic mail (to automatically translate the received mails 
into the language of the user), the systems of online multilingual dialogue (the 
dialogue in several languages), the systems of electronic trade, etc. 

Treatment of Multilingual Corpora. This tool can he used to produce corpora in 
a new language (or at least to produce a first jet, as in the TraCorpEx project), to 
evaluate corpora, etc. 

7 Conclusion 

We have proposed a method of construction of a meta-system of machine multilin- 
gual translation. This system integrates the existing online translation systems. The 
advantages of this system are: no limitation on the length of the text for translation, 
automatic identification of the language of the text for translation, automatic identifi- 
cation and conversion of the coding of the text to adapt to servers, the choice a better 
translator in existing translation engines for a determined language pair, increase in 
the number of the language pairs of translation, and especially, parameterization of 
the existing translation engines to be able to easily integrate in systems which need 
texts translation. In the near future, we will extend our system by an interface allow- 
ing "to self-describe" the MT engines available on the Web, with all information 
necessary (language pairs, codings, formats, etc). We will also extend the function 
getdocO to enable it to call the local MT engines (installed on local machine or local 
server) because they are faster and more stable to access. 
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Abstract. This paper presents a novel approach to the development of 
anaphoric annotation of large corpora based on the use of semantic infor- 
mation to help the annotation process. The anaphora annotation scheme 
has been developed from a multilingual point of view in order to anno- 
tate three corpora: one for Catalan, one for Basque and one for Spanish. 
An anaphora resolution system based on restrictions and preferences has 
been used to aid the manual annotation process. Together with morpho- 
syntactic information, the system exploits the semantic relation between 
the anaphora and its antecedent. 



1 Introduction 

Anaphora and coreference resolution has been during last decades one of the 
most prolific research areas in Natural Language Processing. Nevertheless, it 
seems that this research is in a kind of impasse probably due to the lack of 
large manually annotated corpora. According to [1], the anaphora annotation 
is a “indispensable, albeit time-consuming, preliminary to anaphora resolution, 
since the data they provide are critical to the development, optimization and 
evaluation of new approaches”. For the development of robust anaphora reso- 
lution systems, it is necessary to build large corpora annotated with anaphora 
units and their antecedents, essential for system training and evaluation. 

In this paper we will present a manual anaphora annotation process based on 
the use of semantic information: the process is aided with an anaphora resolution 
system that finds each anaphora and suggests its possible antecedent. One of 
the special features of this system is the use of enriched syntactic and semantic 
information. 

The method has been developed within the 3LB project^. The objective 
of this project is to develop three large annotated corpora: one for Catalan 
(CatSLB), one for Basque (EusSLB) and one for Spanish (CastSLB). These cor- 
pus are annotated at four linguistics levels: morphological, syntactic (complete), 
semantic and discourse. In the latter, anaphora and coreferencial chains are an- 
notated. Due to the corpus (and specifically the Spanish corpus CastSLB) has 



^ Project partially funded by Spanish Government FIT-150-500-2002-244. 
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been previously annotated with syntactic and semantic information, it is possi- 
ble to exploit this information in the anaphoric annotation phase (in fact, the 
proposed AR method make use of these kinds of linguistic information) . 

In order to test the usability of this method in the manual annotation process, 
an experiment about the correctness of the system is necessary, together with the 
annotation agreement measure. In this paper this experiments will be presented, 
in order to specify its real improvement. 

Next section presents the anaphora resolution method enriched with seman- 
tic information. Then, the corpus annotation scheme is presented. The paper 
will conclude with the evaluation data regarding the system’s accuracy and the 
annotation agreement. 



2 Semantic-Aided Anaphora Resolution 

Anaphora resolution (AR) has worried linguists and computer scientists during 
last two decades. This task, considered as one of the most important within the 
ambiguity treatment in Natural Language Processing, has been tackled from 
different points of view and by a wide variety of systems. 

It’s difficult to make an universal classification of AR methods, due to a lot 
of them have developed combined strategies to improve the results, but having 
in mind the interest of this paper to show the relevance of semantic information 
in anaphora resolution we could make the following classification: 

— - : approaches that solve the anaphora using just 

morphological and syntactic information. These are the most prolific meth- 
ods due to their low computational requirements. From first methods [2,3], 
a lot authors have proposed different approaches [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
14] that have demonstrated the high resolution level that can be reached 
only applying basic morphological and syntactic knowledge, in most cases, 
to restricted domain corpora. 

~ : strategies that add, to the previous ones, additional 

sources such as the semantic (based in semantic tagging or in use of ontolo- 
gies) or the pragmatic (through the discourse analysis or world knowledge). 
Although limited knowledge methods provide good computational results 
in AR, their own creators [3] accept the general idea of improving the re- 
sults with enriched knowledge. Although inicial methods [15, 16] have been 
tested on small data sets (mostly manually treated), due to the absence of 
big enough resources, the birth and development of lexical resources like 
WordNet [17] or EuroWordNet [18] and ontologies such as Mikrokosmos [19] 
provide new perspectives in this kind of methods [20,21,22,23]. Discourse 
based theories like centering [24] has given inspiration to several authors in 
AR methods [25, 12]. 

— : this group integrates those methods not included in the 
previous ones. They use techniques based on statistics [26, 27, 13] or artificial 
intelligence models [28, 11]. 
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We would like to remark that most of the methods previously reviewed deal 
with AR problem for English, but none of them deal with the problem using 
enriched strategies for Spanish. 

Furthermore, as mentioned before, one of the most important problems in the 
improvement of AR methods is the lack of resources that integrates linguistic 
information enough to cope with possible enriched approaches. 

2.1 ERA Method 

Enriched Resolution of Anaphora (ERA) method [29] can be included among the 
group of enriched methods. It is the result of adding new information sources 
to a review of a classical AR method based on restrictions and preferences [14]. 
These new information sources come from, on the one hand, the enrichment of 
the syntactic analysis of the text and, on the other hand, the use of semantic 
information both applied to the pronoun resolution (personal, omitted, reflexive 
and demonstrative). 

The enrichment of the syntactic analysis is based on an additional set of 
labels that mark syntactic roles. This labels will allow to redefine the original 
restrictions and avoid different guesses based on syntactic roles that sometimes 
fail due to the free order of the Spanish language^. 

Semantic information is added through a set of labels that indicates the 
correct sense of each word in the text. These correct senses have been selected 
using Spanish WordNet synset identiflers [30] [18]. The method elaborates the 
semantic information using two techniques: 

— : ontological concepts of anaphoric candidates will 

be related to the verb of the anaphora. The semantic features of the lexical 
words have been extracted form the ontological concepts of EuroWorNet, 
that is, the Top Ontology^ With this, and the enriched syntactic information, 
subject-verb,verb-direct object and verb-indirect object semantic patterns 
are extracted. This way, a set of semantic (or ontological) patterns will give 
a measure of semantic compatibility for the preference phase in order to 
score the candidates in the resolution process. 

~ , : two sets of semantic compatibility rules will be 

defined: 



^ The original method which is enriched in ERA is based only in basic syntactic infor- 
mation and is able to nse only the relative position of a noun and a verb, snpposing 
that this position reveals the syntactic role of the former regarding the latter. 

® All the synsets in EnroWordnet are semantically described throngh a set of base 
concepts (the more general concepts). In the EuroWorNet’s Top Ontology, these base 
concepts are classified in the three orders of Lyons [31], according to basic semantic 
distinctions. So through the top ontology, all the synsets of EuroWordNet are se- 
mantically described with concepts like “human”, “animal”, “artifact”, . . . However, 
[32] reports some inconsistences in the process of inheriting semantic properties from 
WordNet hierarchy. At this stage of the project, these inconsistences have not been 
taken into account. 
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• “NO” rules: NO(v#sense,c,r) defines the incompatibility between the 
verb V (and it sense) and any name which contains ’c’ in its ontological 
concept list, being ’r’ the syntactic function that relates them. 

• “MUST” rules: MUST(v#sense,c,r) defines the incompatibility between 
the verb v (and its sense) and all the names that don’t contain ’c’ in 
their ontological concept list, being ’r’ the syntactic function that relates 
them. 

These rules will be applied in the restriction phase in order to delete incom- 
patible candidates. 

Therefore, the specific use of semantic information is related to the sematic 
compatibility (or incompatibility) between the possible antecedent (a noun) and 
the verb of the sentence in which the anaphoric pronoun appears. Due to the 
pronoun replaces a lexical word (the antecedent), the semantic information of 
the antecedent must be compatible with the semantic restrictions of the verb. 
In other words, the anaphoric expression takes the semantic features of the 
antecedent, so they must be compatible with the semantic restrictions of the 
verb. In this way, verbs like “eat” or “drink” should be specially compatible 
with animal subjects and eatable and drinkable objects than others. 

ERA method applies first a set of restrictions based on morphological, syn- 
tactic and semantic information in order to reject all the candidates clearly 
incompatible with the pronoun. Restrictions in ERA mixes classical morphologi- 
cal and syntactic information with the semantic one in order to state the next 
set of restrictions: 

— Morpho-semantic restriction: based in the agreement and specific kinds of 
ontological concepts combination. 

— Syntactic-semantic restrictions: based in syntactic roles of the candidates 
related to the anaphoric verb and their ontological features. 

— Syntactic restrictions: based on the classical positional restrictions but en- 
riched with the specific syntactic roles. These restriction varies from the type 
of pronoun will be solved. 

— Semantic restrictions: based on the semantic rules previously defined (NO 
and MUST). 

Once the incompatible candidates have been rejected, ERA applies a set of 
preferences to get a score of each candidate in order to select the best scored 
one as the correct antecedent of the anaphora. These preferences are based on 
morphological, syntactic, semantic and structural criteria and have variations 
among the different types of treated pronouns. If the preference phase doesn’t 
give a unique candidate as antecedent, a final common preference set is applied 
in order to decide (including the selection, in case of draw, of the candidate 
closest to the pronoun). 

Figure 1 shows the application algorithm defined for the previously defined 
Enriched Resolution of Anaphora (ERA) method. 
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For each sentence S 

L = L + Store NPs with their enrichment data 
Compatibility patterns acquisition with the NPs in L 
For each pronoun P in S 
Identify pronoun P type 

Restriction application to L according to pronoun P type 
L'=Application of morpho-semcuitic restrictions to L 
L'=Application of syntactic-semantic restrictions to L 
L'=Application of syntactic restrictions to L 
L'=Application of incompatibility rules to L 
If |LM = 0 then P is not anaphoric 
If |LM = 1 then L[l] is the antecedent of P 
If ILM >1 then 

Preference application to according to pronoun P type 

L’ = Application of structural and semantic-structural preferences to L’ 
L’ = Application of morphological preferences to 
= Application of syntactic preferences to L’ 

L’ = Application of semantic preferences to L' 

’ = Best(LO 

If |L’M = 1 then L[l] in the antecedent of P 
If |L’ M > 1 then 

L’ = Application of common preferences 
Best(L’) is the antecedent of P 
endlf 
end If 
endFor 
endFor 



Fig. 1. ERA method algorithm application 



According to all said before, the application of this method requires a cor- 
pus tagged not only with basic morphological and shallow syntactic informa- 
tion but enriched with complete syntactic analysis and sense disambiguation. 
Thanks to the 3LB project (that will be described bellow), this resource will be 
a reality soon and allows, by know, to make basic evaluations of his kind of en- 
riched corpus-based approaches. 3LB annotation schema will be described in next 
section. 



3 Annotation Schema: 3LB Project 

CastSLb project is part of the general project 3LB'^. As we said before, the main 
objective of this general project is to develop three corpora annotated with 
syntactic, semantic and pragmatic/coreferential information: one for Catalan 
(CatSLB), one for Basque (Eus3LB) and one for Spanish (Cast3LB). 

The Spanish corpus Cast3LB is a part of the CLIC-TALP corpus, which is 
made up of 100.000 words from LexEsp [33] plus 25.000 words coming from 
the EFE Spanish Corpus, given by the Agencia EFE (the official news agency) 



^ Project partially funded by Spanish Government FIT-150-500-2002-244. 
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for research purposes. The EFE corpus is a comparable fragment to the other 
corpora involved in the general project (Catalan and Basque). 

3.1 Morpho-Syntactic Annotation 

At morphological level, this corpus was automatically annotated and manually 
checked in previous projects [34]. At the syntactic level, the corpus have been 
annotated following the constituency annotation scheme. The main principles of 
syntactic annotation are the following [35]: 

— only the explicit elements are annotated (except for elliptical subjects); 

— the surface word order of the elements is not altered; 

— any specific theoretical framework is followed; 

— the verbal phrase is not taken into account, rather, the main constituents of 
the sentence become the daughters of the root node; 

— the syntactic information is enriched by the functional information of the 
main phrases, but we have not taken into account the possibility of double 
functions. 

3.2 Semantic Annotation 

At the semantic level, the correct sense of nouns, verbs and adjectives has been 
annotated following an all-words approach. The specific sense (or senses) of each 
word is made by means of the EuroWordNet offset number [18], that is, the iden- 
tification number of the sense (synset) in the InterLingua Index of EuroWordNet. 
The corpus has 42291 lexical words, where 20461 are nouns, 13471 are verbs and 
8543 are adjectives. Also, due to some words are not available in EuroWordNet 
or do not have the suitable sense, we have created two new tags to mark this 
special cases. 

Our proposal is based on the SemCor corpus [36], that is formed by ap- 
proximately 250000 words. All nouns, verbs, adjectives and adverbs have been 
annotated manually with WordNet senses [36]. 

We have decided to use Spanish WordNet for several reasons. First of all, 
Spanish WordNet is, up to now, the more commonly used lexical resource in 
Word Sense Disambiguation tasks. Secondly, it is one of the most complete lexical 
resources currently available for Spanish. Finally, as part of EuroWordNet, the 
lexical structure of Spanish and the lexical structure of Catalan and Basque are 
related. Therefore, the annotated senses of the three corpora of 3LB project are 
related too. 

We have followed a transversal (or “lexical”) semantic annotation method 
[37]. In this method, the human annotator marks word-type by word-type, all 
the occurrences of each word in the corpus one by one. With this method, the 
annotator must read and analyze all the senses of a word only once. 

The main advantage of this method is that the annotator can focus the at- 
tention over the sense structure of one word and deal with its specific semantic 
problems: its main sense or senses, its specific senses, .... Then, checks the con- 
text of the single word each time it appears and selects the corresponding sense. 
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Through this approach, semantic features of each word is taken into considera- 
tion only once, and the whole corpus achieves greater consistency. 

For the semantic annotation process, a Semantic Annotation Tool (3LB-SAT) 
has been developed [38]. The main features of this tool are: 

~ it is word-oriented, 

— it allows different format for input corpus; basically, the main formats used 
in corpus annotation: treebank format (TBF) and XML format; 

— it uses EuroWordNet as a lexical resource. 

In the annotation process, monosemic words are annotated automatically. So 
the tools itself is used to annotate polysemic words, and to check if monosemic 
words do not have the suitable sense. 

3.3 Discourse Annotation: Anaphora and Coreference 

At the discourse level, the coreference of nominal phrases and some elliptical 
elements are marked. The coreference expressions taken into account are personal 
pronouns, clitic pronouns, elliptical subjects and some elliptical nominal heads 
(with an adjective as explicit element®.). The possible antecedents considered 
are the nominal phrases or other coreferential expressions. 

Specifically, in each kind of anaphoric expression, we mark: 

— Anaphoric ellipsis: 

• The elliptical subject, made explicit in the syntactic annotation step. 
Being a noun phrase, it could also be an antecedent too. 

Unlike English, where it is possible an expletive pronoun as subject, in 
Spanish it is very common an elliptical nominal phrase as subject of the 
sentence. This is why we have decide to include this kind of anaphora in 
the annotation process. 

• Elliptical head of nominal phrases with an adjective complement. In 
English, this construction is the “one anaphora” . In Spanish, however, 
the anaphoric construction is made up by an elliptical head noun and 
an adjective complement. 

~ Anaphoric pronouns: 

• The tonic personal pronouns in the third person. They can appear in 
subject function or in object function. 

• The atonic pronouns, specifically the clitic pronouns that appear in the 
subcategorization frame of the main verb. 

— Finally, there are sets of anaphoric and elliptical units that corefer to the 
same entity. These units form coreferential chains. They must be marked in 
order to show the cohesion and coherence of the text. They are annotated 
by means of the identification of the same antecedent. 



® This kind of anaphora corresponds with the English “one” anaphora. 




Semantic- Aided Anaphora Resolution in Large Corpora Development 321 



Definite descriptions are not annotated in this project. They consist of nom- 
inal phrases that can refer (or not) to an antecedent. We do not mark them 
because they outline specific problems that make this task very difficult: firstly, 
there are not clear criteria that allow us to distinguish between coreferential and 
not coreferential nominal phrases; secondly, there are not a clear typology for 
definite descriptions; and finally, there are not a clear typology of relationships 
between the definite description and their antecedents. These problems could 
further increase the time-consuming in the annotation process and widen the 
gap of disagreement between the human annotators. 

This proposal of annotation scheme is based on the one used in the MUC 
(Message Understanding Conference) [39] as well as in the works of [40] and [41]: 
this is the mostly used scheme in coreferential annotation [1]. 

In the coreference annotation, two linguistic elements must be marked: the 
coreferential expression and its antecedent. In the antecedent we annotate the 
following information: 

— A reference tag that shows the presence of an antecedent (“REF”), 

— An identification number (“ID”), 

— The minimum continuous substring that could be considerer correct ( “MIN” ) . 
In the coreferential expression, we annotate: 

— The presence of a coreferential expression (“COREF”), 

— An identification number (“ID”), 

— The type of coreferential expression: elliptical noun phrase, coreferential ad- 
jective, tonic pronoun or atonic pronoun (“TYPE”), 

— The antecedent, through its identification number (“REF”), 

— Finally, a status tag where the annotators shows their confidence in the 
annotation (“STATUS”). 

For the anaphoric and coreferential annotation, a Reference Annotation Tool 
(3LB-RAT) has been developed. This tool provides to the annotator three ways 
of work: manual, semiautomatic and automatic. In the first one, the tool locates 
and shows all possible anaphoric and coreference elements and their possible an- 
tecedents. The annotator chooses one of these possible antecedents and indicates 
the certainty degree on this selection (standby, certain or uncertain). 

There are some exceptional cases that the tool always offers: 

~ cases of cataphora 

— possible syntactic mistakes (that will be used to review and to correct the 
syntactic annotation) 

— the possibility of a non-located antecedent 

— the possibility that an antecedent do not appear explicitly in the text 

— the possibility of non-anaphora, that is, the application has no located cor- 
rectly an anaphoric expression 

In the semiautomatic way, the tool solves each coreference by means of the 
enriched resolution anaphora method previously explained. So the system pro- 
poses and shows the most suitable candidate to the annotator. The annotator 
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can choose the solution that the resolution method offers in all cases, or choose 
an other solution (manually). 

The automatic process doesn’t give the annotator the chance of select any 
option and simply solves all the pronouns according to the system’s parameters. 

As we said before, the tool uses syntactic, morphologic and semantic infor- 
mation for the specification of an anaphora and its antecedent. The semantic 
information used by the tool is limited to ontology concepts and synonymous. 
All these data have been indexed. From the semantically annotated text, three 
tables are created, one for each syntactic function: subject, direct object and 
indirect object. In these tables the frequency of appearance of words with verbs 
(with their correct senses) is reflected. These tables are the base to construct 
the semantic compatibility patterns, which indicate the compatibility between 
the ontological concept related with the possible antecedent and the verb of the 
sentence where the anaphoric expression appears. In order to calculate this in- 
formation, the occurrence frequency and the conceptual generality degree in the 
ontology are considered. In this case, a higher punctuation is given to the most 
concrete concepts. 

For example, “Human” concept gives us further information than “Natural” 
concept. These patterns are used in the semantic preferences application. For a 
specific candidate, its semantic compatibility is calculated from the compatible 
ontological concepts on the patterns. The candidates with greater compatibility 
are preferred. 



4 Enriched Anaphora Resolution System in Annotation 
Process. Evaluation Data 

According to the specifications described in section 2.1 related to the applica- 
tion of the enriched anaphora resolution method, it is necessary to count on 
specific information that is provided by the corpus annotation schema. Due to 
the corpus has been annotated with syntactic information, and the sense of each 
word is marked with the offset number of EuroWordNet, it is possible to extract 
semantic features of each verb and noun through the ontological concepts of 
the EuroWordNet’s Top Ontology. Furthermore, the corpus has been annotated 
with syntactic roles, so it is possible to extract syntactic patterns formed by the 
verb and its main complements: subject-verb, verb-direct objects, verb-indirect 
objects. 

What we try to test is if the AR interactive process improves the single 
manual annotation and makes it faster, less tedious, and more consistent through 
the improvement of the annotation agreement. 

The speed of the process lies in the fact that the annotator doesn’t need to 
locate the anaphoric element (pronoun or elliptical in this case) and go back 
in the text in order to search all possible candidates to be the antecedent. The 
system simplifies and makes the process faster and less tedious by the automatic 
detection of the anaphora and the collection of the candidates in a unique list 
based on the space search criteria. 
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Furthermore, as it has been said, the system provides the annotator with a 
possible solution, it means, a suggestion of the correct antecedent. This method- 
ology can contribute to the consistence of the annotation decision, converting a 
completely free process in a guided one. 

In order to test if the AR system (enriched with semantic information) is 
really useful, we have selected the passage with the highest amount of different 
anaphoric elements. The passage have been marked by two annotators using 
the semantic aided AR method (through the 3LB-RAP), and we have calcu- 
lated, on the one hand, the agreement between the annotator and the system 
(through the system’s accuracy) and, on the other hand, the agreement between 
the annotators. 

4.1 AR System’s Accuracy 

One of the main obstacles found in the evaluation proccess has been the selected 
fragment itself. Unfortunately, according to the highest-number-of-anaphoras 
criterion, the selected fragment turned out to be one of the most complex in the 
corpus. This fragment belongs to a novelistic text and has a set of features that 
make really difficult to proccess with the AR system. 

The passage has 36 possible anaphoric expressions that must be annotated: 
23 elliptical subjects, 12 atonic pronouns and 1 tonic pronoun. 

However, only the half (18 units) are anaphoric expressions. The rest of the 
cases correspond: 

~ Cataphoras: Although the system is not designed to solve cataphora, as men- 
tioned in the algorithm in section 2.1 it returns no antecedent, suggesting the 
possibility of a cathaphoric expression. This can be considered as a system 
success. 

— Pronominal verbs: in Spanish, some verbs can be accompanied by pronoun 
particles that modified it but without reference function. The system cannot 
distinguish them unless they have different tags (it does not occur in this 
corpus). 

~ Elliptical subject with a verb in first or second person: these elliptical sub- 
jects are not anaphoric but deictic expressions. System does not deal with 
first or second pronouns. 

— Not explicit antecedent: finally, some anaphoric expressions have not an ex- 
plicit antecedent in the previous text. Of course, it is not possible to find the 
antecedent if it does not exist in the text. 

From this 18 anaphoric expressions, the system has correctly located and 
solved the half. Although this can be considered too low according to the results 
provided by the classical AR approaches, it is fundamental (and very interesting) 
to have in mind the kind of problems detected in the resolution process: 

1. Positional criterion: the selected fragment changes sometimes from direct 
to indirect style due to the existence of dialogues. It is well-known that 
approaches to solve dialogues have to use different solving strategies (e.g. 
based in discourse theories) than the monologue texts (that is the case of 
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ERA method) . For example, the simple selection of the closest candidate fails 
due to changes in the direct-indirect style. 

2. Sentence window: the own features of the text makes that the standard 
search space to look for the solution (established in three sentences) might be 
extended to seven in order to include the correct antecedent in the candidate 
list. Unfortunately, this search window is too big and introduce too much 
noise (the number of candidates are more than the triple) in the resolution 
process. 

To summarize, although the system provides a low accuracy result, it is more 
that justified bearing in mind the special characteristics of the text selected. 
Although another ’’easier” fragment could have been selected for the evaluation, 
we believe that these kinds of texts can help to better improve the AR methods. 

We are sure (based on the evaluation of the ERA method in other corpora) 
that the advances in the 3LB corpus tagging and, therefore, the availability of a 
bigger set of texts, will provide the opportunity of demonstrate a much better 
system capability in a more varied domain corpus. 

4.2 Human Annotation Agreement 

One of the most important problems in manual annotation is the agreement 
between two or more human annotators, and this situation is specially critical in 
coreference annotation [1]. In order to calculate the annotation agreement, the 
work of two annotators has been compared and the kappa measure [42] [43] has 
been calculated®. 

The annotation process has been guided by the ERA method through the 
3LB-RAT tool. Five possible cases of agreement have been defined: 

a) the annotator selects the antecedent suggested by the system 

b) the annotator selects the second antecedent suggested by the system 

c) the annotator selects the third antecedent suggested by the system 

d) the annotator selects other antecedent 

e) other cases (no anaphora, no explicit antecedent, etc.). 

According to these cases, the results show an agreement of fc = 0.84. Accord- 
ing to [44], a kappa measure higher than k = 0.8 is a good agreement, so we 
can consider the AR system as a good consensus tool to guide the annotation 
process. 

5 Conclusions 

In this paper we have presented a coreference annotation scheme supported 
by a semantic-aided anaphora resolution method. We have tried to evaluate 



Kappa measure is calculated according to the well-known k = where Pa 

represents the annotation agreement percentage and Pe the agreement expected by 
coincidence. 
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the way and the grade the use of an AR system helps annotators in their job. 
Furthermore, it has been presented as an interesting agreement tool. Problems 
detected in the resolution process can be used to improve the system and to 
consider new viewpoints in the definition of tagging strategies. 
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Abstract. In order to achieve high precision Question Answering Sys- 
tems or Information Retrieval Systems, the incorporation of Natural Lan- 
guage Processing techniques are needed. For this reason, in this paper a 
method that can be integrated in these kinds of systems, is presented. 
The aim of this method, based on maximnm entropy conditional prob- 
ability models, is semantic role labelling. The method, named SemRol, 
consists of three modules. First, the sense of the verb is disambiguated. 
Then, the argument bonndaries of the verb are determined. Finally, the 
semantic roles that fill these arguments are obtained. 



1 Introduction 

One of the challenges of applications such as Information Retrieval (IR) or Ques- 
tion Answering (QA), is to develop high quality systems (high precision IR/QA). 
In order to do this, it is necessary to involve Natural Language Processing (NLP) 
techniques in this kind of systems. Among the different NLP techniques which 
would improve Information Retrieval or Question Answering systems it is found 
Word Sense Disambiguation (WSD) and Semantic Role Labelling (SRL). In this 
paper a method of Semantic Role Labelling using Word Sense Disambiguation 
is presented. This research is integrated in the project R2D2^. 

A semantic role is the relationship that a syntactic constituent has with a 
predicate. For instance, in the next sentence 

(EO) The executives gave the chefs a standing ovation 

has the Agent role, the Recipient role and 

the Theme role. 

The problem of the Semantic Role Labeling is not trivial. In order to identify 
the semantic role of the arguments of a verb, two phases have to be solved. Firstly, 
the sense of the verb is disambiguated. Secondly, the argument boundaries of the 
disambiguated verb are identified. 



^ This paper has been supported by the Spanish Government under project ”R2D2: 
Recuperacion de Respuestas de Documentos Digitalizados” (TIC2003-07158-C04- 
01 ). 

J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 328-339, 2004. 
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(El) John gives out lots of candy on Halloween to the kids on his block 
(E2) The radiator gives off a lot of heat 

Depending on the sense of the verb a different set of roles must be consid- 
ered. For instance, Figure 1 shows three senses of verb and the set of roles of 
each sense. So, sentence (EO) matches with sense 01. Therefore, roles , 

and . are considered. Nevertheless, sentence (El) matches 

with sense 06 and sentence (E2) matches with sense 04. Then, the sets of roles 
are ( , , ) and ( , ), re- 
spectively. In sentence (El), has the distributor role, . the thing 

distributed role, , the distributed role and the 

temporal role. In sentence (E2), has the emitter role and 

the thing emitted role. 



<roleset id="give.01" name="transfer"> <roles> 


<roleset id="give.04" name="emit"> 


<roleset id="give.06" name="transfer"> 


<rolen="0" descr=’'giver" vntheta=’'Agent’'/> 


<roles> 


<roles> 


<rolen="l" descr=’'thing given" vntheta="Theme"/> 


<role n-'O" descr="emitter"/> 


<role n="0" descr="distributor"/> 


<role n="2" descr="entity given vntheta="Recipient"'V> 


<role n=’T' descr="thing emitted"/> 


<role n=" I " descr="thing distributed’7> 


</roles> 


</roles> 


<role n="2" descr="distributed'7> 
</roles> 



Fig. 1. Some senses and roles of the frame give in PropBank [18] 



To achieve high precision IR/QA systems, recognizing and labelling seman- 
tic arguments is a key task for answering ’’Who”, ’’When”, ’’What”, ’’Where”, 
’’Why”, etc. For instance, the following questions could be answered with the 
sentence (EO). The Agent role answers the question (E3) and the Theme role 
answers the question (E4). 

(E3) Who gave the chefs a standing ovation? 

(E4) What did the executives give the chefs? 

Currently, several works have used WSD or Semantic Role Labeling in IR or 
QA systems, unsuccessfully. Mainly, it is due to two reasons: 

1. The lower precision achieved in these tasks. 

2. The lower portability of these methods. 

It is easy to find methods of WSD and Semantic Role Labeling that work with 
high precision for a specific task or specific domain. Nevertheless, this precision 
drops when the domain or the task are changed. For these reasons, this paper is 
about the problem of a Semantic Role Labeling integrated with WSD system. A 
method based on a corpus approach is presented and several experiments about 
both, WSD and Semantic Role Labeling modules, are shown. Shortly, a QA 
system with this Semantic Role Labeling module using WSD will be developed 
in the R2D2 framework. 
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The remaining paper is organized as follows: section 2 gives an idea about the 
state-of-art in automatic Semantic Role Labeling systems in the subsection 2.1. 
Besides, subsection 2.2 summarizes Maximum Entropy Models as the approach 
used. Afterwards, the maximum entropy-based method is presented in subsection 
2.3. Then, some comments about experimental data, and an evaluation of our 
results using the method, are presented in sections 3 and 4, respectively. Finally, 
section 5 concludes. 

2 The SemRol Method 

The method, named SemRol, presented in this section consists of three main 
modules: i) Word Sense Disambiguation Module, ii) Module of Heuristics, and 
iii) Semantic Role Disambiguation Module. Both Word Sense Disambiguation 
Module and Semantic Role Disambiguation Module are based on Maximum En- 
tropy Models. Module of Heuristics and Semantic Role Disambiguation modules 
take care of recognition and labeling of arguments, respectively. WSD module 
means a new phase in the task. It disambiguates the sense of the target verbs. 
So, the task turns more straightforward because semantic roles are assigned to 
sense level. 

In order to build this three-phase learning system, training and development 
data ser are used. It is used PropBank corpus [18], which is the Penn Treebank 
corpus [17] enriched with predicate-argument structures. It addresses predicates 
expressed by verbs and labels core arguments with consecutive numbers (AO to 
A5), trying to maintain coherence along different predicates. A number of ad- 
juncts, derived from the Treebank functional tags, are also included in PropBank 
annotations. 

2.1 Background 

Several approaches [3] have been proposed to identify semantic roles or to build 
semantic classifier. The task has been usually approached as a two phase proce- 
dure consisting of recognition and labeling arguments. 

Regarding the learning component of the systems, we find pure probabilistic 
models ([8]; [9]; [7]), Maximum Entropy ([6]; [2]; [15]), generative models [26], 
Decision Trees ([25]; [5]), Brill’s Transformation-based Error-driven Learning 
([13]; [28]), Memory-based Learning ([27]; [14]), and vector-based linear classi- 
fiers of different types: Support Vector Machines (SVM) ([12]; [20]; [21]), SVM 
with polynomial kernels ([11]; [19]), and Voted Perceptrons also with polynomial 
kernels [4], and finally, SNoW, a Winnow-based network of linear separators 
[ 22 ]. 

There have also been some attempts at relaxing the necessity of using syn- 
tactic information derived from full parse trees. For instance, in ([20]; [12]), only 
shallow syntactic information at the level of phrase chunk is used; or in the 
systems presented in the CoNLL-2004 shared task only partial syntactic infor- 
mation, i.e., words, part-of-speech (PoS) tags, base chunks, clauses and named 
entities, is used. 
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Regarding the labeling strategy, it can be distinguished at least three different 
strategies. The first one consists of performing role identification directly by 
a lOB-type^ sequence tagging. The second approach consists of dividing the 
problem into two independent phases: recognition, in which the arguments are 
recognized, and labeling, in which the already recognized arguments are assigned 
role labels. The third approach also proceeds in two phases: filtering, in which a 
set of argument candidates are decided and labeling, in which a set of optimal 
arguments is derived from the proposed candidates. As a variant of the first 
two-phase strategy, in [27] first is performed a direct classification of chunks into 
argument labels, and then the actual arguments are decided in a post-process 
by joining previously classified arguments fragments. 

2.2 Maximum Entropy Models 

Maximum Entropy (ME) modelling provides a framework to integrate informa- 
tion for classification from many heterogeneous information sources [16]. ME 
probability models have been successfully applied to some NLP tasks, such as 
PoS tagging or sentence boundary detection [23]. 

The method presented in this paper is based on conditional ME probabil- 
ity models. It has been implemented using a supervised learning method that 
consists of building classifiers using a tagged corpus. A classifier obtained by 
means of an ME technique consists of a set of parameters or coefficients which 
are estimated using an optimization procedure. Each coefficient is associated 
with one feature observed in the training data. The main purpose is to obtain 
the probability distribution that maximizes the entropy, that is, maximum igno- 
rance is assumed and nothing apart from the training data is considered. Some 
advantages of using the ME framework are that even knowledge-poor features 
may be applied accurately; the ME framework thus allows a virtually unre- 
stricted ability to represent problem-specific knowledge in the form of features 
[23]. 

Let us assume a set of contexts X and a set of classes C . The function 
d : X ^ C chooses the class c with the highest conditional probability in the 
context x: d{x) = argmaxcp(c|a;). Each feature is calculated by a function that 
is associated with a specific class c', and it takes the form of equation (1), where 
cp{x) is some observable characteristic in the context^. The conditional prob- 
ability p{c\x) is defined by equation (2), where Oj is the parameter or weight 
of the feature i, K is the number of features defined, and Z{x) is a constant 
to ensure that the sum of all conditional probabilities for this context is equal 
to 1. 



^ lOB format represents chunks which do not overlap nor embed. Words outside a 
chunk receive the tag O. For words forming a chunk of type k, the first word receives 
the B-fc tag (Begin), and the remaining words receive the tag I-fc (Inside). 

® The ME approach is not limited to binary functions, but the optimization proce- 
dure used for the estimation of the parameters, the Generalized Iterative Scaling 
procedure, uses this feature. 
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1 if c' = c and cp{x) = true 
0 otherwise 



fix, c) = 



( 1 ) 



p{c\x) 



Z{x) 



K 

n 






( 2 ) 



2.3 The Core of SemRol Method 

The method consists of three main modules: i) Word Sense Disambiguation 
(WSD) Module, ii) Module of Heuristics, and iii) Semantic Role Disambiguation 
(SRD) Module. 

Unlike the systems presented in section 2.1, first of all, the process to obtain 
the semantic role needs the sense of the target verb. After that, several heuristics 
are applied in order to obtain the arguments of the sentence. And finally, the 
semantic roles that fill these arguments are obtained. 

Word Sense Disambiguation Module. This module is based on the WSD 
system developed by [24]. It is based on conditional ME probability models. 

The learning module produces classifiers for each target verb. This module 
has two subsystems. The first subsystem consists of two component actions: 
in a first step, the module processes the learning corpus in order to define the 
functions that will apprise the linguistic features of each context; in a second step, 
the module then fills in the feature vectors. The second subsystem of the learning 
module performs the estimation of the coefficients and stores the classification 
functions. 

The classification module carries out the disambiguation of new contexts 
using the previously stored classification functions. When ME does not have 
enough information about a specific context, several senses may achieve the 
same maximum probability and thus the classification cannot be done properly. 
In these cases, the most frequent sense in the corpus is assigned. However, this 
heuristic is only necessary for a minimum number of contexts or when the set of 
linguistic attributes processed is very small. 

The set of features defined for the training of the system 
is described below (Figure 2) and depend on the data in the training corpus. 
These features are based on words, PoS tags, chunks and clauses in the local 
context. 



SW features: content words in a sentence 
CW features: content words in a clause 
HP features: heads in syntactic phrases 
HLRP features: heads in -1, +1 syntactic phrases 



Fig. 2. List of types of features in WSD module 



Content-words refer to words with PoS related to noun, adjective, adverb or 
verb. For instance, if the sentence 
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(E5) Confidence in the pound is widely expected to take another sharp dive 
if trade figures for September, due for release tomorrow, fail to show a 
substantial improvement from July and August’s near-record deficits 

is considered, the SW features is the set of words: , , , 



Heads in syntactic phrases refer to words with PoS related to noun, in a noun 
phrase; or related to verb, in a verb phrase. 

Module of Heuristics. After determining the sense for every target verb of 
the corpus, it is necessary to determine the argument boundaries of those verbs. 
In a first approach), two arguments, the left argument and the right argument, 
have been considered for each target verb. The left/right argument is made up 
of the words of the sentence at the left /right of the verbal phrase where the 
target verb is included. Besides, these words must belong to the same clause 
as the target verb. If the sentence (E5) is considered, where the target verb is 
, its left and right arguments are 

and , ’ 

, respectively. 

In a second approach, left and right arguments have been also considered. 
However, in this case, the left argument is only the noun phrase at the left of 
the target verb, and the right argument is only the noun phrase at the right of 
the target verb. Besides, if exists a prepositional phrase close together the right 
noun phrase, it will be considered a second right argument. In any case, the 
phrases must belong to the same clause as the target verb. So, in the sentence 

(E6) The current account deficit will narrow to only 1.8 billion in September 

the target verb is , the left argument is and 

right arguments are , and 

Both approximation consider the verbal phrase of the target verb as the verb 
argument, and modal verbs and particles and ’ in the verbal phrase of 
the target verb, as arguments. For instances, in the previous sentence, is 
considered an argument. 

It is expected that the number of successes in left arguments, modal argu- 
ments and negative arguments, will be high and it will not account for much 
error. However, the results in right arguments will be probably lower. In future 
works we will take interest in determining the arguments of the verbs using a 
machine learning strategy, such as a maximum entropy conditional probability 
method, or a support vector machines method [10]. This strategy will allow us 
to determine the argument boundaries more accurately. 

Semantic Role Disambiguation Module. Finally, the role for each target 
verb depending on sense will be determined. This task uses a conditional ME 
probability model. This one is like the method used in WSD task. In this case, 
features are extracted for each argument for every target verb. These features 
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are used to classify those arguments. Instead of working with all roles [6], in this 
classification, the classes considered will be the roles of each sense of each verb. It 
increases the total number of the classes for the full task on SRD, but it reduces 
the partial number of classes that are taken into account in each argument, con- 
siderably. In the sentence (E5), the sense of fail is 01, so, the classes of the roles 
0,1, 2, 3, of fail. 01 have just been considered, however the roles 0,1 of fail. 02 have 
not been considered. It is possible to do this because the sense of every target 
verb was determined in the WSD module. Figure 3 shows the roles of verb. 



<roleset id="fail.01" name="not succeed"> <roles> 

<role n="0" descr="assessor of not failing (professor)"/> 
<role n=" 1 " descr="thing failing"/> 

<role n-'2" descr="task"/> 

<role n="3" descr="benefactive"/> 

</roles> 

<roleset id="fail.02" name="give failing grade"> <roles> 
<role n="0" descr="teacher"/> 

<rolen="l" descr="student"/> 

</roles> 

Fig. 3. Senses and roles of the frame fail in PropBank 



For each argument, the features are based on words 
and part of speech tags in the local context. The words in the arguments which 
part of speech tag is one of the following NN, NNS, NNP, NNPS, JJ, JJR, JJS, 
RB, RBR, RBS have been considered. That is, only nouns, adjectives or adverbs 
have been considered. 

In addition, verbs (VB, VBD, VBG, VBN, VBP, VBZ, MD) have been con- 
sidered whether they are target verbs or not. This set of features is named AW, 
content-words in the argument. In the (E5) instance, AW for left argument is the 
set of words ; and AW for right 

argument is the set of words . 

A straightforward classifier with just one set of features has been built. This 
is an attempt to evaluate the performance of the module with simple events and 
low computational cost. 

3 Experimental Data 

Our method has been trained and evaluated using the PropBank corpus [18], 
which is the Penn Treebank [17] corpus enriched with predicate-arguments struc- 
tures. To be precise, the data consists of sections of the Wall Street Journal. 
Training set matches with sections 15-18 and development set matches with 
section 20. 
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PropBank annotates the Penn Treebank with arguments structures related 
to verbs. The semantic roles considered in PropBank are the following [3]: 

— Numbered arguments (A0-A5, AA): Arguments defining verb-specific roles. 

Their semantic depends on the verb and the verb usage in a sentence, or 
verb sense. In general, AO stands for the and A1 corresponds to the 

or of the proposition, and these two are the most frequent 

roles. However, no consistent generalization can be made across different 
verbs or different senses of the same verb. PropBank takes the definition of 
verb senses from VerbNet, and for each verb and each sense defines the set 
of possible roles for that verb usage, called roleset. 

— Adjuncts (AM-): General arguments that any verb may take optionally. 
There are 13 types of adjuncts: 

• AM-LOC: location 

• AM-EXT: extent 

• AM-DIS: discourse marker 

• AM-ADV: general-porpouse 

• AM-NEC: negation marker 

• AM-MOD: modal verb 

• AM-CAU: cause 

• AM-TEMP: temporal 

• AM-PRP: purpose 

• AM-MNR: manner 

• AM-DIR: direction 

— References (R-): Arguments representing arguments realized in other parts 
of the sentence. The role of a reference is the same than the role of the 
referenced argument. The label is an R-tag preceded to the label of the 
referent, e.g. R-Al. 

— Verbs (V): Participant realizing the verb of the proposition. 

Training data consists of 8936 sentences, with 50182 arguments and 1838 
distinct target verbs. Development data consists of 2012 sentences, with 11121 
arguments and 978 distinct target verbs. 

Apart from the correct output, both datasets contain the input part of the 
data: PoS tags, chunks and clauses. Besides, the sense of verb is available if the 
word is a target verb. 



4 Results and Discussion 

Following, the results of the three modules are shown in Table 1. These results 
have been obtained on the development set. Modules have been evaluated based 
on precision, recall and FI measure. In each case, precision is the proportion of 
senses, arguments or roles predicted by the system which are correct; and recall 
is the proportion of correct senses, correct arguments or correct roles which are 
predicted by each module. FI computes the harmonic mean of precision and 
recall: F/3=i=(2pr)/(p-|-r). 
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Table 1. Results on the development set 



WSD 


Successes 


3650 


Precision 


0.88 




Fails 


486 


Recall 


0.85 




No disambiguated 


169 


FI 


0.86 


MH 


Successes 


4426 


Precision 


0.51 




Fails 


4201 


Recall 


0.40 




No detected 


2494 


FI 


0.45 


SRD 


Successes 


5085 


Precision 


0.48 




Fails 


5464 


Recall 


0.46 




No disambiguated 


572 


FI 


0.47 



4.1 WSD Module Results 

In this experiment one set of features have just been considered, features about 

. Table 1 shows that 3650 verbs have been disam- 
biguated successfully, and 655 unsuccessfully. From these, 169 are due to no 
disambiguated verbs and 486 to mistakes in the disambiguation process. As a 
result, a precision of 88% is obtained. Such precision has been calculated includ- 
ing all verbs, polysemic and monosemic verbs. These results show the goodness 
of the ME module and reveal that the ME module is correctly defined. Besides, 
it is expected that the tuning with the others set of features (see section 2.3) 
will improve the results. 

4.2 MH Module Results 

This module has been tested with two approaches described in section 2.3. Table 
1 shows the result about the second one (see section 2.3). In this case, a total of 
4426 arguments have been detected successfully, but 6695 have been erroneously 
detected or missing. Therefore, the precision of MH module is 51%. Other ex- 
periments carried out with the other approach drops the precision to 34%. 

In any case, the experiments have been done assuming correct senses for 
target verbs. By means of this, the independence of MH module in relation to 
WSD module has been evaluated. 

These results confirm the need for determining the arguments of the verbs 
by defining new heuristics or using a machine learning strategy. 

4.3 SRD Module Results 

In order to evaluate this module, correct senses of the verbs and correct argument 
boundaries have been presumed. So, SRD module has been tested independently 
of WSD and MH modules. 

Table 1 shows a precision of 48%. For further details, the precision for each 
kind of argument is shown in Table 2. Besides, if verb argument is considered, 
precision goes up to 62%. These results show that the ME module is correctly 
defined. However, it is need a tuning phase in order to improve them. Besides, 
a precision of 0,00% in several arguments shows the need for a co-reference 
resolution module. 
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Table 2. Results on the development set. SRD module 





Precision 


Recall 


P/3=l 




Precision 


Recall 


F/3=1 


AO 


47.22% 


44.56% 


45.85 


AM-MNR 


66.29% 


17.66% 


27.90 


A1 


44.99% 


69.12% 


54.50 


AM-MOD 


82.35% 


39.59% 


53.47 


A2 


52.26% 


26.62% 


35.28 


AM-NEG 


80.00% 


64.12% 


71.19 


A3 


30.16% 


12.75% 


17.92 


AM-PNC 


39.47% 


15.00% 


21.74 


A4 


64.52% 


13.61% 


22.47 


AM-PRD 


0.00% 


0.00% 


0.00 


A5 


100.00% 


25.00% 


40.00 


AM-PRP 


0.00% 


0.00% 


0.00 


AM-ADV 


21.93% 


7.10% 


10.73 


AM-REC 


0.00% 


0.00% 


0.00 


AM-CAU 


0.00% 


0.00% 


0.00 


AM-TMP 


61.98% 


21.48% 


31.90 


AM-DIR 


55.56% 


8.33% 


14.49 


R-AO 


0.00% 


0.00% 


0.00 


AM-DIS 


86.44% 


25.00% 


38.78 


R-Al 


0.00% 


0.00% 


0.00 


AM-EXT 


70.00% 


14.29% 


23.73 


R-A2 


0.00% 


0.00% 


0.00 


AM-LOC 


44.44% 


22.61% 


29.97 


R-AM-LOC 


100.00% 


25.00% 


40.00 


R-AM-TMP 


33.33% 


33.33% 


33.33 


V 


97.44% 


97.44% 


97.44 










all 


61.92% 


59.62% 


60.75 










all-{V} 


47.42% 


44.98% 


46.17 



5 Conclusions and Working in Progress 

In this paper, a Semantic Role Labeling method using a WSD module is pre- 
sented. It is based on , . . The 

method presented consists of three sub-tasks. First of all, the process of ob- 
taining the semantic role needs the sense of the target verb. After that, several 
heuristics are applied in order to obtain the arguments of the sentence. And 
finally, the semantic roles that fill these arguments are obtained. Training and 
development data are used to build this learning system. 

Results about the WSD, MH and SRD modules have been shown. Currently, 
we are working on the definition of new features to the SRD modules. So, the 
re-definition of the heuristics is planned in order to improve the results. After 
that, we are going to work on the tuning phase in order to achieve an optimum 
identification of the semantic roles. 

On the other hand, we are working in the integration of semantic aspects in 
Question Answering or Information Retrieval Systems, in order to obtain High 
Precision QA/IR Systems. Shortly, we will show results about this incorpora- 
tions. 



References 

1. Eighth Conference on Natural Language Learning (CoNLL-2004), Boston, MA, 
USA, Mayo 2004. 

2. U. Baldewein, K. Erk, S. Pado, and D. Prescher. Semantic role labeling with chunk 
sequences. In Proceedings of the Eighth Conference on Natural Language Learning 
(CoNLL-2004) [1]. 




338 



P. Moreda et al. 



3. X. Carreras and L. Marquez. Introduction to the CoNLL-2004 Shared Task: Seman- 
tic Role Labeling. In Proceedings of the Eighth Conference on Natural Language 
Learning (CoNLL-2004) [!]• 

4. X. Carreras, L. Marquez, and G. Chrupala. Hierarchical Recognition of Proposi- 
tional Arguments with Perceptrons. In Proceedings of the Eighth Conference on 
Natural Language Learning (CoNLL-2004) [1]. 

5. J. Chen and O. Rambow. Use of deep linguistic features for the recognition and 
labeling of semantic arguments. In Proceedings of the Conference on Empirical 
Methods in Natural Language Processing (EMNLP2003), July 2003. 

6. M. Fleischman, N. Kwon, and E. Hovy. Maximum Entropy Models for FrameNet 
Classification. In Proceedings of the Conference on Empirical Methods in Natural 
Language Processing (EMNLP2003), July 2003. 

7. D. Gildea and J. Hockenmaier. Identifying semantic roles using combinatory cate- 
gorial grammar. In Proceedings of the Conference on Empirical Methods in Natural 
Language Processing (EMNLP2003), July 2003. 

8. D. Gildea and D. Jurafsky. Automatic labeling of semantic roles. Computational 
Linguistics, 28(3):245-288, 2002. 

9. D. Gildea and M. Palmer. The necessity of parsing for predicate argument recog- 
nition. In Proceedings of the 40th Annual Meeting of the Association for Compu- 
tational Linguistic (ACL), Philadelphia, Julio 2002. 

10. J. Gimenez and L. Marquez. Fast and Accurate Part-of-Speech Tagging: The 
SVM Approach Revisited. In Proceedings of Recent Advances in Natural Language 
Processing 2003, Borovets, Bulgaria, Septiembre 2003. 

11. K. Hacioglu, S. Pradhan, W. Ward, J.H. Martin, and D. Jurafsky. Semantic Role 
Labeling by Tagging Syntactic Chunks. In Proceedings of the Eighth Conference 
on Natural Language Learning (CoNLL-2004) [!]• 

12. K. Hacioglu and W. Ward. Target word detection and semantic role chunking 
using support vector machines. In Proceedings of the Human Language Technology 
Conference (HLT-NAACL), Edmonton, Canada, Junio 2003. 

13. D. Higgins. A transformation-based approach to argument labeling. In Proceedings 
of the Eighth Conference on Natural Language Learning (CoNLL-2004) [!]• 

14. B. Kouchnir. A Memory-based Approach for Semantic Role Labeling. In Pro- 
ceedings of the Eighth Conference on Natural Language Learning ( CoNLL-2004 ) 

[Il- 
ls. J. Lim, Y. Hwang, S. Park, and H. Rim. Semantic role labeling using maximum 
entropy model. In Proceedings of the Eighth Conference on Natural Language 
Learning (CoNLL-2004) [!]• 

16. C.D. Manning and H. Schiitze. Foundations of Statistical Natural Language Pro- 
cessing. The MIT Press, Cambridge, Massachusetts, 1999. 

17. M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated 
corpus of english: the penn treebank. Computational Linguistics, (19), 1993. 

18. M. Palmer, D. Gildea, and P. Kingsbury. The proposition bank: An annotated 
corpus of semantic roles. Computational Linguistics, 2004. Submitted. 

19. K. Park, Y. Hwang, and H. Rim. Two-Phase Semantic Role Labeling bsed on 
Support Vector Machines. In Proceedings of the Eighth Conference on Natural 
Language Learning (CoNLL-2004) [!]• 

20. S. Pradhan, K. Hacioglu, V. Krugler, W. Ward, J.H. Martin, and D. Jurafsky. 
Support vector learning for semantic argument classification. Technical report, 
International Gomputer Science Institute, Genter for Spoken Language Research, 
University of Golorado, 2003. 




SemRol: Recognition of Semantic Roles 339 



21. S. Pradhan, K. Hacioglu, W. Ward, J.H. Martin, and D.Jurafsky. Semantic 
role parsing: Adding semantic structure to unstructured text. In Proceedings of 
the Third IEEE International Conference on Data Mining (ICDM), Melbourne, 
Florida, USA, Noviembre 2003. 

22. V. Punyakanok, D. Roth, W. Yih, D. Zimak, and Y. Tu. Semantic Role Labeling 
Via Generalized Inference Over Classifiers. In Proceedings of the Eighth Conference 
on Natural Language Learning (CoNLL-2004) [1]. 

23. A. Ratnaparkhi. Maximum Entropy Models for Natural Language Ambiguity Res- 
olution. PhD thesis. University of Pennsylvania, 1998. 

24. A. Suarez and M. Palomar. A maximum entropy-based word sense disambiguation 
system. In Proceedings of the 19th International Conference on Computational 
Linguistics (COLING), pages 960-966, Taipei, Taiwan, Agosto 2002. 

25. M. Surdeanu, S. Harabagiu, J. Williams, and P. Aarseth. Using predicate-argument 
structures for information extraction. In Proceedings of the 41st Annual Meeting of 
the Assoeiation for Computational Linguistics (ACL), Sapporo, Japan, Julio 2003. 

26. A. Thompson, R. Levy, and C.D. Manning. A generative model for semantic role 
labeling. In Proceedings of the 14th European Conference on Machine Learning 
(ECML), Cavtat-Dubrovnik, Croatia, Septiembre 2003. 

27. A. van den Bosch, S. Canisius, I. Hendricks, W. Daelemans, and E.T.K. Sang. 
Memory-based semantic role labeling: Optimizing features, algorithm and output. 
In Proceedings of the Eighth Conference on Natural Language Learning ( CoNLL- 
2004) [1]. 

28. K. Williams, C. Dozier, and A. McCulloh. Learning Transformation Rules for 
Semantic Role Labeling. In Proceedings of the Eighth Conference on Natural Lan- 
guage Learning (CoNLL-2004) [1]. 




Significance of Syntactic Features for 
Word Sense Disambiguation 



Ala Sasi Kanth and Kavi Narayana Murthy 

Department of Computer and Information Sciences 
University of Hyderabad, India 
sasi_kanth_a@yahoo . co . in, knmuhSyahoo . com 



Abstract. In this paper^ we explore the use of syntax in improving the 
performance of Word Sense Disambiguation(WSD) systems. We argne 
that not all words in a sentence are useful for disambiguating the senses 
of a target word and eliminating noise is important. Syntax can be used 
to identify related words and eliminating other words as noise actually 
improves performance significantly. CMU’s Link Parser has been nsed for 
syntactic analysis. Snpervised learning techniques have been applied to 
perform word sense disambignation on selected target words. The Naive 
Bayes classifier has been nsed in all the experiments. All the major gram- 
matical categories of words have been covered. Experiments conducted 
and resnlts obtained have been described. Ten fold cross validation has 
been performed in all cases. The results we have obtained are better than 
the published results for the same data. 



1 Introduction 

A word can have more than one sense. The sense in which the word is used can 
be determined, most of the times, by the context in which the word occurs. The 
word , has several senses out of which , as a financial institution and 
as a sloping land bordering a river can be easily distinguished from the context. 
Distinguishing between the senses of . as a financial institution and . as 
a building housing such an institution is more difficult. The process of identify- 
ing the correct sense of words in context is called 

(WSD). Homonymy and Polysemy must both be considered. Word sense disam- 
biguation contributes significantly to many natural language processing tasks 
such as machine translation and information retrieval. 

The focus of research in WSD is on distinguishing between senses of words 
within a given syntactic category, since senses across syntactic categories are 
better disambiguated through POS tagging techniques. Many researchers have 
focused on disambiguation of selected target words although there is some recent 
interest in unrestricted WSD [1,2]. 
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mission under the UPE scheme. 
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WSD systems often rely upon sense definitions in dictionaries, features of 
senses (for example, box-codes and subject categories present in Longman’s Dic- 
tionary of Contemporary English (LDOCE)), entries in bilingual dictionaries, 
WordNet etc. Dictionaries and other sources do not always agree on the num- 
ber and nature of senses for given words. For some tasks the fine granularity 
of senses as given in some dictionaries is not required or may even be counter 
productive and so methods to merge closely related senses have been explored 
by some researchers [3] . 

Both knowledge based and machine learning approaches have been applied 
for WSD. Lesk [4] used glossaries of senses present in dictionaries. The sense 
definition which has the maximum overlap with the definitions of the words in 
the context was taken as the correct sense. Lesk’s algorithm uses the knowledge 
present in dictionaries and does not use any sense tagged corpus for training. On 
the other hand, machine learning methods require a training corpus. Yarowsky 
[5] devised an algorithm which takes some initial seed collocations for each sense 
and uses unsupervised techniques to produce decision lists for disambiguation. 
In supervised disambiguation, machine learning techniques are used to build a 
model from labeled training data. Some of the machine learning techniques used 
for WSD are - decision lists [6,7], Naive Bayes classifier [8] and decision trees. 
In this work we have used the Naive Bayes Classifier. 

It has been argued that the choice of the right features is more important than 
the choice of techniques for classification [9, 10]. A variety of features have been 
used, including bigrams, surface form of the target word, collocations, POS tags 
of target and neighboring words and syntactic features such as heads of phrases 
and categories of phrases in which the target word appears. Some researchers 
believe that lexical features are sufficient while others [11, 12] have argued for 
combining lexical features with syntactic features. In this paper we show that 
syntax can significantly improve the performance of WSD systems. We argue 
that elimination of noise is important - not all words in a given sentence are 
useful for disambiguating the sense of a target word. We have used CMU’s Link 
parser to identify words that are syntactically related to the target word. Words 
which are not syntactically related to the target word are considered to be noise 
and eliminated. The results we get are comparable to, or better than, the best 
results obtained so far on the same data. 



2 Role of Syntax in WSD 

Not all words in the context are helpful for determining the sense of a target 
word. Syntax can help in identifying relevant parts of the context, thereby elim- 
inating noise. Using syntactic features for WSD is not entirely new. Ng [13] used 
syntactic information such as verb-object and subject-verb relationships along 
with the basic lexical features. Yarowsky [14] also used similar syntactic infor- 
mation including verb-object, subject-verb and noun-modifier. Stetina [15] and 
some of the work presented in the Senseval-2 workshop [16] have also explored 
the possibility of combining lexical and syntactic features. Recently, Mohammad 
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and Pedersen [11] have analyzed the role of various kinds of lexical and syntac- 
tic features in isolation as well as in various combinations. They also employ 
an ensemble technique to combine the results of classifiers using different sets 
of features. They propose a method to estimate the best possible performance 
through such an ensemble technique. They use a simple ensemble method and 
show that their results are comparable to the best published results and close 
to the optimal. However, the exact contribution of syntax is not very clear from 
these studies. David Martinez et al [12] have compared the performance ob- 
tained by using only basic (lexical and topical) features with the performance 
obtained by using basic features combined with syntactic features. They show a 
performance gain of 1% to 2 % for the AdaBoost algorithm while there was no 
improvement for the Decision List method. In this paper we explore the role of 
syntactic features in WSD and show that syntax can in fact make a significant 
contribution to WSD. We have obtained 4% to 12% improvement in performance 
for various target words. The results we get are comparable to, or better than, 
the best published results on the same data. 

We have used the Link parser developed by Carnegie-Mellon University. The 
link parser gives labeled links which connect pairs of words. We have found this 
representation more convenient than parse trees or other representations given 
by other parsers. Our studies have shown that eliminating noise and using only 
selected context words is the key to good performance. Syntax has been used only 
for identifying related words. In the next section we describe the experiments we 
have conducted and the results obtained. 

3 Experimental Setup 

Here we have applied supervised learning techniques to perform word sense dis- 
ambiguation on selected target words. All the major grammatical categories of 
words have been covered. The Naive Bayes classifier has been used as the base. 
Ten fold cross validation has been performed in all cases. We give below the de- 
tails of the corpora and syntactic parser used and the details of the experiments 
conducted. 

3.1 Corpora 

For our experiments we have used publicly available corpora converted into 
Senseval-2 data format by Ted Pedersen ^ We have chosen , , and 

as the target words, covering the major syntactic categories - noun, verb 
and adjective respectively. 

In corpus each instance of the word is tagged with one of 

six possible LDOCE senses. There is a total of 2368 occurrences in the sense 
tagged corpus, where each occurrence is a single sentence that contains the word 
. The instances in the corpus are selected from Penn Treebank Wall Street 
Journal Corpus(ACL/DCI version). Sense tagging was done by Rebecca Bruce 



http:/ /www.d.umn.edu/'tpederse/data.html 
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and Janyce Wiebe [17]. The sense tags used, frequency, and glosses of the senses 
are given in Table 1. 



Table 1. Senses of the word interest, their distribution in the corpus and the gloss of 
senses 



Sense Label 


Frequency 


Sense Definition 


interest-l 


361(15%) 


readiness to give attention 


interest _2 


11(01%) 


quality of causing attention to be given to 


interest _3 


66(03%) 


activity, etc. that one gives attention to 


interests 


178(08%) 


advantage, advancement or favor 


interest _5 


500(21%) 


a share in a company or business 


interest _6 


1252(53%) 


money paid for the use of money 



The corpus contains the word with part of speech as adjective 
and is manually tagged with three senses in 4333 contexts. The data was 
created by Leacock, Chodorow and Miller [18]. The instances were picked from 
the San Jose Mercury News Corpus and manually annotated with one of three 
senses form WordNet. The sense tags used, frequency in the corpus, glosses of 
the senses and examples are given in Table 2. 

Table 2. Senses of the word hard, their distribution in the corpus, glosses of senses 
and examples 



Sense Label 


Frequency 


Sense Definition 


Example 


HARDl 


3455(80%) 


not easy - difficult 


it’s hard to be disciplined 


HARD2 


502(11%) 


not soft - metaphoric 


these are hard times 


HARD3 


376(9%) 


not soft - physical 


the hard crust 



The corpus contains the word with part of speech as verb and 

is manually tagged in 4378 contexts. The data was created by Leacock, 

Chodorow and Miller [18]. The instances were picked from the Wall Street Jour- 
nal Corpus(1987-89) and the American Printing House for the Blind (APHB) 
corpus. The sentences have been manually tagged with the four senses from 
WordNet. The sense tags used, frequency in the corpus, glosses of the senses and 
examples are given in Table 3. 

3.2 Parser 

For obtaining the syntactic information we have used the link parser from 
Carnegi e-Mellon University^. Link parser is a syntactic parser based on link 
grammar, a theory of English syntax [19, 20]. It is a robust and broad coverage 
parser. If in case it is unable to parse the sentence fully, it tries to give a partial 



http: / /www. link. cs.cmu.edu/link/ 
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Table 3. Senses of the word serve, their distribution in the corpus, and the gloss of 
senses 



SenseLabel 


Frequency 


Sense Definition 


Example 


SERVE2 


853(20%) 


function as something 


serves as yard stick to 


SERVE6 


439(10%) 


provide a service 


department will serve select few 


SERVEIO 


1814(41%) 


supply with food/means 


serve dinner 


SERVE 12 


1272(29%) 


hold an office 


served as head of department 



structure to the sentence. Given a sentence, the parser assigns to it a syntactic 
structure, which consists of a set of labeled links connecting pairs of words. For 
example the parsed structure of the sentence 

"The flight landed on rocky terrain" 

is given by the link parser as 

H Jp h 

+ — D*u-+ Ss +-MVp-+ + A + 

II III I 

the f light. n landed. v on rocky. a terrain. n 

A^- connects attributive adjectives to following nouns 
S - connects subject nouns to finite verbs 
D - connects determiners to nouns 
J - connects prepositions to their objects 

MV- connects verbs and adjectives to modifying phrases that follow, 
like adverbs, prepositional phrases, subordinating conjunctions, 
comparatives and participle phrases with commas. 

A word to be disambiguated may be connected directly to another word with 
a link or it can be indirectly connected with a series of links. Consider the target 
word in the context: 

+ A + 

I +— AN— + 

I I I 

. . . .heavy. a interest. n rates.n 

AN - connects noun-modifiers to 
following nouns 

Here the word is directly linked with the word , and the word 

, is indirectly linked to . The words which are directly or indirectly 



^ The lower case letters in the label specify additional properties of the links. These 
are not used in our experiments. 
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connected to the target word can be taken as the values of the attribute la- 
beled with the name or names of the links. An exclamation mark indicates a 
leftward link to the target word. The attribute and the value are represented as 
(attribute, value) pairs. See examples below. 

+ AN + + AN + 

I I I I 

interest. n rates.n key.n interests.n 
(! AN, rates) (AN, key) 



+ A + 

I +— AN— + 

I I I 

heavy. a interest. n rates.n 
( ! AN, rates) (A- ! AN, heavy) 

Alternatively, we can simply use the syntactically related words as features, 
without regard to the specific syntactic relationships. The Link parser employs 
a large number of different types of links and including the link labels greatly 
increases the space of possible feature values, thereby introducing sparseness in 
available data. In our experiments described here, we have used only words as 
features, not the link labels. 



4 Experiments and Results 

We give below the details of the experiments we have conducted. The results are 
summarized in the table 4 below. 

4.1 Experiment A: Baseline 

The baseline performance can be defined as the performance obtained when the 
most frequent sense in the corpus is assigned for the target word in all contexts. 
This can be viewed as a bottom-line for comparison of various techniques. The 
base line performance depends on the flatness of the distribution of the various 
senses for a given target word. If the senses are all more or less equally likely, 
the baseline would be low and if one of the senses is much more frequent than 
others, the baseline would be high. It can be seen that the baseline for is 
quite high. 

4.2 Experiment B: NaiveBayes (All Words in the Context) 

Here all the words in the context of the target word are taken as features. By 
context we mean the whole sentence in which the target word appears. The 
words in the context are considered as a bag of words without regard to word 
order or syntactic structure. 
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It may be noted that the performance is in general much better than the 
baseline. (There is, however, a small decrease in the case of the word - the 
distribution of senses of the word is quite peaked and the baseline itself is 
quite high.) 

Not all words in the sentence are likely to be useful for disambiguating the 
target word. Some words may even have negative effect on the performance. 
Eliminating or reducing noise should help to achieve better results. The ques- 
tion that remains is the basis for including or excluding words in the context. 
Syntax captures the internal structure of sentences and explicates a variety of 
relationships amongst words. We argue, therefore, that syntax should be useful 
for deciding which words in the sentence are related to the target word and hence 
likely to influence its sense. The following subsections show various experiments 
we have carried out to verify this claim. We have found CMU’s Link Parser to 
be an appropriate choice since it directly indicates relationships between pairs of 
words. Our studies show that syntax indeed has a very significant contribution 
in improving the performance of WSD. 

4.3 Experiment C: NaiveBayes (Syntactically Relevant Words) 

In this experiment, all the words which are linked directly or indirectly (up to two 
levels) to the target word, that is, a bag of selected words from the sentence, are 
taken as features. It may be seen from the table of results that performance has 
significantly improved. This vindicates our claim that not all words in context 
are useful and elimination of noise is important. 

4.4 Experiment D: NaiveBayes (Words in a Window) 

Our studies have shown that neighboring words often have a considerable influ- 
ence on the sense of a target word. For example, adjectives and the nouns they 
modify occur close together and tend to influence one another. The object of a 
verb may appear close to the verb and have a say in the sense of the verb. Our 
results for a window of ±2 words around the target word validate this point. 
The results also confirm our claim that not all words in the sentence are relevant 
and eliminating noise helps. 

We have conducted experiments with various sizes of windows around the 
target word. The best results are obtained for a window size of 2 to 3. As the 
window size gets larger, more and more noise words will start getting in and the 
performance drops. 

4.5 Experiment E: NaiveBayes (Syntactically Related and 
Neighborhood Words) 

Here we try to combine syntactically linked words and words in the neighborhood 
of the target word for feature selection. Note that neighboring words may not 
always be linked syntactically. In this study words syntactically related to the 
target word either directly or through one level of intermediate link have been 
included. Also, words in a neighborhood of ±2 are included. All other words in 
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the sentence are treated as noise and ignored. It may be seen that this combina- 
tion generally outperforms the other schemes. In fact the results obtained here 
are better than the best published results [11]. It may be noted that we have 
actually used only lexical features, not any syntactic features directly. Syntax 
has been used only to identify related words and remove other words as noise. 

We have also conducted experiments where the features included not just 
the related words but also the specific syntactic relations as expressed by the 
links in the Link parser. This greatly enlarges the feature space as there are 
several hundred different types of links. The sparseness of training data under 
this expanded feature space will limit the performance obtainable. Our experi- 
ments have shown that not much improvement in performance is possible with 
the available data. 

4.6 Results 

Table 4 shows the results of our experiments: 



Table 4. Table showing the accuracy (in %) for the three words 



words 


1 Experiment I 


A: Baseline 


B: All words 
in sentence 


C: Only syntactically 
related words 


D: Neighboring 
words 


E: Selected 
words 


interest(n) 


52.87 


85.66 


86.14 


87.94 


89.39 


hard (ad) 


79.73 


78.22 


90.59 


91.53 


90.91 


serve(v) 


41.43 


76.88 


81.93 


81.23 


85.25 



5 Conclusions 

In this paper we have explored the contribution of syntax for word sense disam- 
biguation. Not all words in a sentence are helpful for identifying the intended 
sense of a given target word. Syntactic information can be used to identify useful 
parts of the context and thereby reduce noise. We have not directly used any 
syntactic features. Syntax helps in the selection of the right lexical features and 
our experiments show that elimination of noise can significantly improve the per- 
formance of WSD. Improvement in performance ranges from about 4 % to about 
12 %. Overall performance achieved ranges from about 85 % to about 90 % and 
is comparable to, or better than, the best results published on similar data. 
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Abstract. In the current European scenario, characterized by the coexistence of 
communities writing and speaking a great variety of languages, machine trans- 
lation has become a technology of capital importance. In areas of Spain and of 
other countries, coofficiality of several languages implies producing several ver- 
sions of public information. Machine translation between all the languages of the 
Iberian Peninsula and from them into English will allow for a better integration 
of Iberian linguistic communities among them and inside Europe. The purpose 
of this paper is to show a machine translation system from Spanish to Catalan 
that deals with text input. In our approach, both deductive (linguistic) and induc- 
tive (corpus-based) methodologies are combined in an homogeneous and efficient 
framework: finite-state transducers. Some preliminary results show the interest of 
the proposed architecture. 



1 Introduction 

Machine translation and natural computer interaction are questions that engineers and 
scientists have been interested in for decades. In addition to their importance for the 
study of human speech characteristics, these applications have social and economic 
interests because their development would allow for a reduction of the linguistic barriers 
that prevent us to make with confidence activities as, for example, travelling to other 
countries or the access to some computer science services (foreign websites and so on). 



* Work partially supported by the Spanish CICYT under grants TIC 2000-1599-C02-01 and TIC 
2003-0868 1-C02-02 and by the Research Supporting Program from the Univ. Pol. of Valencia. 

J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 349-359, 2004. 

© Springer- Verlag Berlin Heidelberg 2004 




350 



J.R. Navarro et al. 



Machine translation has received an increasing attention in the last decades due to 
its commercial interest and to the availability of large linguistic and computational re- 
sources. These resources are allowing machine translation systems to leave the academic 
scope to become more useful tools for professionals and general users. 

Nevertheless, natural language complexity creates too many difficulties to develop 
high quality systems. This opens multiple investigation lines in which researchers work 
hard to improve translation results. The three most important machine translation pro- 
blems are: 

- PoS' tagging, whose objective is to identify the lexical category that a word has in 
a sentence [1, 2, 3]. 

- Semantic disambiguation, that decides which is the right sense of a word in a text 
[4,5]. 

- Reordering, which can appear quite often when translating between different family 
languages. 

The approaches that have been traditionally used to face these problems can be clas- 
sified into two big families: knowledge-based and corpus-based methods. Knowledge- 
based techniques formalize expert linguistic knowledge, in form of rules, dictionaries, 
etc., in a computable way. Corpus-based methods use statistical pattern recognition tech- 
niques to automatically infer models from text samples without necessarily using a-priori 
linguistic knowledge. 

SisHiTra (Hybrid Translation System) project tries to combine knowledge-based 
and corpus-based techniques to produce a Spanish-to-Catalan machine translation sys- 
tem with no semantic constraints. Spanish and Catalan are languages belonging to the 
Romance language family and have a lot of characteristics in common. SisHiTra makes 
use of their similarities to simplify the translation process. A SisHiTra future perspective 
is the extension to other language pairs (Portuguese, French, Italian, etc.). 

Knowledge-based techniques are classical approaches to tackle general scope ma- 
chine translation systems. Nevertheless, inductive methods have shown competitive re- 
sults dealing with semantically constrained tasks. 

Moreover, finite-state transducers [6, 7, 8] have been successfully used to implement 
both rule-based and corpus-based machine translation systems. Techniques based on 
finite-state models have also allowed for the development of useful tools for natural 
language processing [9, 10, 11,3] that are interesting because of their simplicity and 
their adequate temporal complexity. 

With the experience acquired in InterNOSTRUM [12] and TAVAL [13], SisHiTra 
project was proposed. SisHiTra system is able to deal with both eastern and western 
Catalan dialectal varieties, because the dictionary, which is its main database, establishes 
explicitely such distinction. 

SisHiTra prototype has been thought to be a serial process where every module 
performs a specific task. In the next section we will explain the different parts in which 
SisHiTra system is divided. 



* Parts of Speech. 
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2 Implementation 

The methodologies that are going to be used to represent the different knowledge sources 
(dictionary, module interfaces, etc.) are based on finite-state machines: Hidden Markov 
Models (HMM) are applied in disambiguation modules [13], and stochastic transducers 
are used as data structures for dictionary requests as well as for inter-module communi- 
cation. Reasons for using finite-state methodology are as following: 

- Finite-state machines are easily represented in a computer, which facilitates their 
exploitation, visualization and transference. 

- There are algorithms that allow for their manipulation in an efficient way (Viterbi, 
beam search, etc.). 

- There are algorithms for their automatic inference (both their topology and their 
associated probability distributions) from examples. 

- Linguistic knowledge incorporation can be adequately carried out. 

- It allows for both serial or integrated use of the different knowledge sources. 

- More powerful models can be used, such as context-free grammars, by means of a 
finite-state approach. 

2.1 System Architecture 

The system developed in SisHiTra project translates from Spanish to Catalan. It is a 
general scope translator with a wide vocabulary recall, so it is able to deal with all kind 
of sentences. 

As previously commented, translation prototype modules are based on finite-state 
machines. This provides an homogeneous and efficient framework. Engine modules 
process input text in Spanish by means of a cascade of finite-state models that represent 
both linguistic and statistical knowledge. As an example, PoS tagging of input sentences 
is done by means of a serial process that requires the use of two finite-state machines: 
first of them represents a knowledge-based dictionary and the second one defines a 
corpus-based disambiguation model. Finite-state models are also used to represent partial 
information during translation stages (e.g. lexically ambiguous sentences). 

SisHiTra and lots of other systems need, somehow, to semantically disambiguate 
words before turning them into target language items. Semantic disambiguation me- 
thods try to find out the implicit meaning of a word in a surrounding context. 

SisHiTra is designed to make semantic disambiguation in two steps: first, a rule- 
based module solves some ambiguities according to certain well-known linguistic in- 
formation and, afterwards, a second module ends the job by means of corpus-based 
inductive methods. Statistical models are receiving more interest every day for seve- 
ral reasons. The most important one is that they are cheaper and faster to generate 
than knowledge-based systems. Statistical techniques learn automatically from corpora, 
without the process of producing linguistic knowledge. Of course, obtaining corpora 
for model training is not a task free of effort. Models for semantic disambiguation in 
SisHiTra need parallel corpora, that is, corpora where text segments (as sentences or 
paragraphs) in a language are matched with their corresponding translations in the other 
language. These corpora have been obtained from different bilingual electronic publi- 
cations (newspapers, official texts, etc.) and they have been paralleled through different 
alignment algorithms. 
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SisHiTra system is structured in the following modules: 

- Fragmenter module: It divides the original text into sentences. 

- Labeler module: A dictionary request produces a syntactic graph that represents 
all the possible analyses over the input sentence. 

- Syntactic disambiguation module: By means of statistical models, it finds the most 
likely syntactic analysis between all those that labeler module produces. 

- Nominal phrase agreement module: Every phrase element must agree in gender 
and number with each other. 

- Localizer module: Another dictionary request produces a lemma-graph that repre- 
sents all the possible translations for the previously analyzed and disambiguated 
sentence. 

- Semantic disambiguation module: Here, a prototype in which disambiguation is 
carried out according only to the dictionary is presented, but we are testing some 
beta-systems that consider statistical models to make this decision. 

- Inflection module: Lemmas are turned into their corresponding inflected words 
from the morphological information previously analyzed. 

- Formatting module: Contraction and apostrophization are applied in order to res- 
pect the Catalan ortographic rules. 

- Integration module: Compilation of translations, according to the original text 
format, is finally done. 

In the following section, examples are used in order to show the functionality of each 
module in a more concrete way. 

3 Modules 

3.1 Fragmenter Module 

Input text must be segmented into sentences so that translation can be carried out. By 
means of rules, this module is able to do it. 

Input: Input to this module is Spanish text to be translated. 

La estudiante atendio. 

Output: Output from this module expresses the whole text in a xml format, in which 
every paragraph, sentence, translation unit (ut) and upper character has been de- 
tected. 

<doc> <p> <o> <ut ort="M">la</ut> <ut>estudiante</ut> <ut>atendi6</ut> 

</o> </p> </doc> 



3.2 Labeler Module 

This module outputs a graph that represents all the syntactic analysis possibilities for the 
input sentence. The applied method consists of a full search of translation units (words 
or compound expressions) through a finite-state network representing the dictionary. 
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Input: Input to this module are fragmented sentences. 

<doc> <p> <o> <ut ort="M">la</ut> <ut>estudiante</ut> <ut>atendi6</ut> 

</o> </p> </doc> 

Output: Output from this module is a finite-state transducer in which each edge asso- 
ciates translation units and lexical categories^ according to the dictionary. Note that 
each translation unit, represented as a conection between states, can be referred to 
both a word or a compound expression, since TAVAL dictionary stores lexical labels 
for single words as well as for phrases. 




NC s:la'^f2 MC AO s:estudiante^f23 CS VPI s:@atender -S3— SD 



Fig. 1. Labeler’s output 

Fig. 1 shows all the possible PoS-tags for the example sentence, together with some 
linguistic information. In concrete, word la can be a pronoun, an article or a noun. Word 
estudiante can be an adjective or a noun; it is a singular word (S) and its gender depends 
on some issues that are implemented in Nominal phrase agreement module (C). PoS-tags 
for word atendio are VPT and VPI, both corresponding to an inflected form^ from verb 
atender. 

3.3 Syntactic Disambiguation Module 

Syntactic disambiguation aims to decide the lexical category that a word has in a context. 
To do this, both rule-based and corpus-based techniques are applied. 

Statistical disambiguation can be defined as a maximization problem. Let W = 
{wl^W 2 ^ • . • , wn} be the source language vocabulary and let C = {ci, C 2 , . . . , Cm} be 
all the possible categories. Given an input sentence w = wi, . . . , wl, the process can 
be accomplished by searching the category sequence c = c\, . . . ,cl that maximizes: 

c = argmaxP(c|?z;) (1) 

cGC^ 

Using Bayes rule and given that the maximization process is independent on the 
input sentence w, equation (1) can be rewritten as: 

c = argmaxP(c)P(tu|c) (2) 

cGC^ 

In this equation, contextual (or language model) probabilities, P(c), represent all the 
possible category sequences, whereas emission (or lexical model) probabilities, P(w\c), 
establish the relationship between words and categories. 



” PA:Pronoun, HD: Article, NC:Noun, A2:Adjective, VPTTransitive verb, VP/: Intransitive verb. 
^ 3’^'* person singular (S3) from simple past (SD). 
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To solve this equation, certain Markov assumptions can be accepted to simplify 
the problem. First, contextual probabilities for one determined category are assumed to 
only depend on the immediately previous n categories. The second constraint consists 
of assuming that emission probabilities only depend on the category itself. 

For 1st order Markov models (bigrams), problem is reduced to solve next equation: 

c = argmaxf P{a\ci-i)P{wi\ci) ) (3) 

Parameters from this equation can be represented as a Hidden Markov Model (HMM) 
in which states and categories are one-to-one associated. Contextual probabilities, 
P{ci\ci-i), are transition probabilities between states, and lexical model probabilities, 
P{wi\ci), can be seen as word-category probability distributions. Viterbi algorithm [14] 
has been used to bnd, for a given input sentence, its associated category sequence. 

Input: Input to this module is labeler’s output, which is represented in Fig. 1, and 
models all the possible syntactic analyses for the input sentence. 

Output: Output from this module is the linear graph corresponding to the most likely 
path through the input graph, according to the category-based models described 
before. 



DDs:la FS 



NC s:estudiante'^f23 CS VPT s: ©atender -S3— SD 




Fig. 2. Syntactic disambiguation output 



Actually, some rules are used so as to reduce ambiguity, then the statistical disambi- 
guation model presented here is applied. 

3.4 Nominal Phrase Agreement Module 

Due to the fact that training corpus for syntactic disambiguation does not include infor- 
mation about word gender or number, it is necessary to perform a subsequent process 
making agree all the words in each nominal phrase. 

The followed method consists of nominal phrase localization inside a sentence by 
means of a knowledge-based nominal phrase codification in terms of category sequences 

[15]. 

Once a nominal phrase has been located, it is possible to make agree gender and 
number words inside it through the application of some hierarchical rules that depend 
on the kind of phrase detected. 

luput: Input to this module is Syntactic disambiguation module’s output. As Fig. 2 
shows, it consists of a linear graph containing PoS-tag labelling. 
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Output: Output from this module offers sentences in which gender and number agree- 
ment has been made at a nominal phrase level. In Fig. 3, it is possible to see a 
nominal phrase detection {la estudiante) and, as a result, noun’s gender has been 
agreed with article’s. 




3.5 Localizer Module 

This module is dedicated to expand each ut into all its possible translations according to 
the dictionary. 

Input: Input to this module is agreement module’s output, where nominal phrase marks 
have been deleted. 



DD s:la FS 



NC s:estudiante^f23 FS 



VPT s:@atender -S3 — 



It© 



Fig. 4. Nominal phrase agreement output without phrase marks 



Output: Output from this module is a lemma graph including every possible translation 
to the input graph, according to the dictionary. 

VPT c:@atendre -S3— SD 




3.6 Semantic Disambiguation Module 

Semantic disambiguation module tries to decide the right translation for a ut according 
to the input sentence context. In this paper, only the most likely translation for each 
dialectal variety is taken into account. Dictionary entries have their meanings manually 
scored. Therefore, for each ut, prototype chooses the best scored sense in a user-given 
dialectal variety. Corpus-based statistical models are planned to be working on future 
versions of this module. 
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Input: Input to this module is localizer’s output. As Fig. 5 shows, every possible trans- 
lation to each ut from Fig. 4 is represented. 

Output: Output from this module is a linear graph which corresponds to the best scored 
path through the input graph. 



VPT c:@atendre -S3— SD 




3.7 Inflection Module 

This is a rule-based module which makes word inflection according to the Catalan in- 
flection model. 

Input: Input to this module is Semantic disambiguation module’s output, which is 
shown in Fig. 6. It represents a Catalan lemma sentence to be inflected. 

Output: Output from this module is input’s inflection, that is, a sentence in which 
lemmas have been turned into words according to some inflection rules. 



0 »( 1 





estudiant ^ ( ^^2^ atengue 







Fig. 7. Inflecter’s output 



3.8 Formatting Module 

This module is also a rule-based module and it makes some apostrophization & contrac- 
tion procedures according to the Catalan grammar. 

Input: Input to this module is inflection module’s output, which can be seen in Fig. 7. 

Output: This module finally offers well-written sentences from an ortographic point of 
view. In Fig. 8, it is possible to see the transformation of La estudiant into U estudiant 
as well as an alternative way of expressing past tenses, which tends to be more usual. 




L’ estudiant 




va atendre 




Fig. 8. Formatter’s output 
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3.9 Integration Module 

This module turns finite-state linear graphs into sentences, according to the original text 
format. 

Input: Module’s input is formatter’s output, which Fig. 8 shows. 

Output: Output from this module (or final output) is displayed as a Catalan text, which 
is a translation for the Spanish input text. 

L’estudiant va atendre. 



4 Experiments 

4.1 Corpora 

In order to be able to make a statistical estimation of the different models used in the 
implemented version of the prototype, diverse data corpora have been collected. 

Specific tools were developed to look for information through the web. These tools 
were very useful, especially at the time of collecting the necessary corpora. 

LexEsp corpus [16], with nearly 90.000 running words, was used to estimate syntactic 
disambiguation model parameters. A label, from a set of approximately 70 categories, 
was manually assigned to each word. 

Other two corpora {El periodica de Cataluna and Diari oficial de la Generalitat 
Valenciana) were obtained by means of web tools. These corpora will be used in some 
system improvements such as training models for semantic disambiguation. These cor- 
pora consist of parallel texts, aligned at sentence level, in a Spanish-to-Catalan translation 
framework without semantic constraints. 

In order to perform the system assessment, a bilingual corpus was created. This corpus 
is composed of 240 sentence pairs, extracted from different sources and published in 
both languages. Of course, they are not included in any training corpus. 

- 120 sentence pairs from El Periodica de Cataluna, with no semantic constraints. 

- 50 pairs from Diari Oficial de la Generalitat Valenciana, a official publication from 
the Valencian Community government. 

- 50 pairs from technical software manuals. 

- 20 pairs from websites (Valencia Polytechnical University, Valencia city council, 
etc.). 

4.2 Results 

Word error rate (WER'^) is a translation quality measure that computes the edition dis- 
tance between translation hypotheses and a predefined reference translation. The edition 
distance calculates the number of substitutions, insertions and deletions that are needed 
to turned a translation hypothesis into the reference translation. The accumulated num- 
ber of errors for all the test sentences is then divided by the number of running words. 



Also known as Translation WER (TWER). 
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and the resulting percentage shows the average number of incorrect words. Since it can 
be automatically computed, it has become a very popular measure. WER results for 
SisHiTra system can be seen at Table 1 . 

Table 1. WER comparison for some machine translation systems 
System WER 

InterNOSTRUM 11.9 
SisHiTra 10.1 

SALT 9.9 



A disadvantage of WER is that it only compares the translation hypothesis with a 
fixed reference translation. This does not offer any margin to possible right translations, 
expressed in a different writing style. So, to avoid this problem, we used the WER with 
multireferences (MWER^) for evaluating the prototype. MWER considers several refer- 
ence translations for a same test sentence, then computes the edition distance with all of 
them, returning the minimum value as the error corresponding to that sentence. MWER 
offers a more realistic measure than WER because it allows for more variability in 
translation style. MWER results for SisHiTra system are similar to the reached ones 
by other commercial systems (InterNOSTRUM® and SALT^), as it can be seen at 
Table 2. 



Table 2. MWER comparison for some machine translation systems 



System MWER 



InterNOSTRUM 


8.4 


SisHiTra 


6.8 


SALT 


6.5 



5 Conclusions and Future Work 

In the framework of SisHiTra project, a general scope Spanish-to-Catalan translation 
prototype has been developed. The translation process is based on finite-state machines 
and statistical models, automatically inferred from parallel corpora. Translation results 
are promising enough, considering that there are still a lot of things to be done. 

We hope to improve results through the correction of some mistakes, accidentally 
produced at some of the hand-made knowledge sources (dictionary, grammatical rules, 
etc.), as well as to prosper in the prototype modular development, including new pro- 
cesses to increase translation quality. 



^ Multi-reference Word Error Rate. 

® See http://www.torsimany.ua.es 

’ See http://www.cult.gva.es/DGOIEPL/SALT/salt_programes_salt2.htm 
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The most relevant areas where the system could be improved are: 

- Semantic disambiguation, where statistical models for ambiguous words could be 
trained in order to be able to choose the most appropriate context-dependent trans- 
lations. 

- Gender and number agreement between verbal phrases. 

- Disambiguation in some verb pairs like: ser and ir, creer and crear, etc. since they 
have lexical forms in common. 

Finally, a SisHiTra future perspective is the extension to other Romance languages 

(Portuguese, French, Italian, etc.). 
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Abstract. This paper presents an algorithm to apply the smoothing techniques 
described in [15] to three different Machine Learning (ML) methods for Word 
Sense Disambiguation (WSD). The method to obtain better estimations for the 
features is explained step by step, and applied to n-way ambiguities. The results 
obtained in the Senseval-2 framework show that the method can help improve the 
precision of some weak learners, and in combination attain the best results so far 
in this setting. 



1 Introduction 

Many current Natural Language Processing (NLP) systems rely on linguistic knowledge 
acquired from tagged text via Machine Learning (ML) methods. Statistical or alterna- 
tive models are learned, and then applied to running text. The main problem faced by 
such systems is the sparse data problem, due to the small amount of training examples. 
Focusing on Word Sense Disambiguation (WSD), only a handful of occurrences with 
sense tags are available per word. For example, if we take the word channel, we see 
that it occurs 5 times in SemCor [10], the only all-words sense-tagged corpus publicly 
available: the first sense has four occurrences, the second a single occurrence, and the 
other 5 senses are not represented. For a few words, more extensive training data exists. 
Senseval-2 [4] provides 145 occurrences of channel, but still some of the senses are 
represented by only 3 or 5 occurrences. 

It has to be noted that both in NLP and WSD, most of the events occur rarely, even 
when large quantities of training data are available. Besides, hne-grained analysis of the 
context requires that it is represented with many features, some of them rare, but which 
can be very informative. Therefore, the estimation of rare-occurring features might be 
crucial to have high performances. 

Smoothing is the technique that tries to estimate the probability distribution that 
approximates the one we expect to hnd in held-out data. In WSD, if all occurrences of 
a feature for a given word occur in the same sense. Maximum Likelihood Estimation 
(MLE) would give a 0 probability to the other senses of the word given the feature, 
which is a severe underestimation. We will denote these cases as X/0, where X is the 
frequency of the majority sense, and zero is the frequency of the other senses. 

For instance, if the word Jerry occurs in the context of art only once in the training 
data with a given sense, does it mean that the probability of other senses of art occurring 
in the context of Jerry is 0? We will see in Section 4.3 that this is not the case, and that 



J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 360-371, 2004. 
(gl Springer- Verlag Berlin Heidelberg 2004 
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the other senses are nearly as probable. Our smoothing study will show for this feature 
of the word art that the smoothed ratio should be closer to 1/1. 

In this paper, we follow the smoothing method proposed by Yarowsky in his PhD 
dissertation [15], and present a detailed algorithm of its implementation for the WSD 
problem, defining some of the parameters used, alongside the account of its use by 
three different ML algorithms: Decision Lists (DL), Naive Bayes (NB), and Vector 
Space Model (VSM). The impact of several smoothing strategies is also presented, and 
the results indicate that the smoothing method explored in this paper is able to make 
both statistically motivated methods (DL and NB) perform at very high precisions, 
comparable and in some cases superior to the best results attained in the Senseval-2 
competition. We also show that a simple combination of the methods and a fourth system 
based on Support Vector Machines (SVM) attains the best result for the Senseval-2 
competition reported so far. 

An independent but related motivation for this work is the possibility to use smoothing 
techniques in bootstrapping approaches. Bootstrapping techniques such as [16] have 
shown that if we have good seeds, it could be possible to devise a method that could 
perform with quality similar to that of supervised systems. Smoothing techniques could 
help to detect rare but strong features which could be used as seeds for each of the target 
word senses. 

The paper is organized as follows. Section 2 presents the experimental setting. Sec- 
tion 3 introduces smoothing of feature types and Section 4 presents the detailed algorithm 
with examples. Section 5 presents the results and comparison with other systems, and, 
finally, the last section draws some conclusions. 



2 Experimental Setting 

In this section we describe the target task and corpus used for evaluation, the type of 
features that represent the context of the target word, and the ML algorithms applied to 
the task. 

2.1 Corpus 

The experiments have been performed using the Senseval-2 English Lexical-Sample data 
[4]. This will allow us to compare our results with the systems in the competition and 
with other recent works that have focused on this dataset. The corpus consists on 73 target 
words (nouns, verbs, and adjectives), with 4,328 testing instances, and approximately 
twice as much training. We used the training corpus with cross-validation to estimate 
the C parameter for the SVM algorithm, and to obtain the smoothed frequencies for the 
features (see below). For the set of experiments in the last section, the systems were 
trained on the training part, and tested on the testing part. 

A peculiarity of this hand-tagged corpus is that the examples for a given target word 
include multiword senses, phrasal verbs, and proper nouns. A separate preprocess is 
carried out in order to detect those cases with 96.7% recall. 
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2.2 Features 

The feature types can be grouped in three main sets: 

Local Collocations: bigrams and trigrams formed with the words around the target. 
These features are constituted by lemmas, word-forms, or PoS tags^ . Other local features 
are those formed with the previous/posterior lemma/word-form in the context. 
Syntactic Dependencies: syntactic dependencies were extracted using heuristic pat- 
terns, and regular expressions defined with the PoS tags around the target^. The following 
relations were used: object, subject, noun-modifier, preposition, and sibling. 
Bag-of-words Features: we extract the lemmas of the content words in the whole 
context, and in a ±4-word window around the target. We also obtain salient bigrams in 
the context, with the methods and the software described in [14]. 



2.3 ML Methods 

Given an occurrence of a word, the ML methods below return a weight for each sense 
(weight{sk))- The sense with maximum weight will be selected. The occurrences are 
represented by the features in the context (fi). 

The Decision List (DL) algorithm is described in [15]. In this algorithm the sense 
Sk with the highest weighted feature / is selected, as shown below. In order to avoid 0 
probabilities in the divisor, we can use smoothing or discard the feature altogether. 



weight{sk) = argmax log = 

f i2j^kns3f) 



(1) 



The Naive Bayes (NB) method is based on the conditional probability of each sense 
Sk given the features fi in the context. It requires smoothing in order to prevent the 
whole productory to return zero because of a single feature. 



weight(sk) = P{sk) n™i^(/i|sfc) 



(2) 



For the Vector Space Model (VSM) method, we represent each occurrence context 
as a vector, where each feature will have a 1 or 0 value to indicate the occurrence/absence 
of the feature. For each sense in training, one centroid vector is obtained These 

centroids are compared with the vectors that represent testing examples (/), by means 
of the cosine similarity function. The closest centroid assigns its sense to the testing 
example. No smoothing is required to apply this algorithm, but it is possible to use 
smoothed values instead of Is and Os. 

weight{sk) = cos{Cs^,f) = (3) 

Regarding Support Vector Machines (SVM) we utilized S VM-Light, a public dis- 
tribution of SVM by [8]. We estimated the soft margin (C) using a greedy process in 
cross-validation on the training data. The weight for each sense is given by the distance 
to the hyperplane that supports the classes, that is, the sense Sk versus the rest of senses. 



* The PoS tagging was performed with the fnTBL toolkit [13]. 

^ This software was kindly provided by David Yarowsky’s group, from the Johns Hopkins 
University. 
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3 Feature- Type Smoothing 

We have already seen in the introduction that estimating X/0 features with MLE would 
yield a probability P{s\f) = 1 for the majority sense and a probability P{s\f) = 0 for 
the minority senses, which is an underestimation. Features with X/0 counts are usual 
when the training data is sparse, and these values must be smoothed before they are fed 
to some learning algorithms, such as DL or NB, as they lead to undetermined values in 
their formulations. 

Other distributions, such as X/1, X/2, ... can also be estimated using smoothing tech- 
niques. [15] argues that the probability of the second majority sense in X/1 distributions 
would be overestimated by MLE. For intermediate cases, such as X/2, X/3, etc. it is not 
clear whether the effort of modeling would be worth pursuing. For higher frequencies, 
using the raw frequency could be good enough. In this work we focused in X/0 and X/1 
distributions. 

The smoothing algorithm shown here (which we will call feature-type smoothing) 
follows the ideas of [15]. The main criteria to partition the training data has been to use 
raw frequencies and feature types (e.g. prevJNjwf, feature type that represents the first 
noun word-form to the left of the target). Raw frequency is the most important parameter 
when estimating the distribution, and joining features of the same type is a conservative 
approach to partition the data. Therefore we join all occurrences of the prevJ^.wf feature 
type that have the same frequency distribution for the target word, e.g. 1/0. This way, 
we perform smoothing separately for each word. 

We could use the smoothed values calculated in this manner directly, but many data 
points would still be missing. For instance, when studying prevJ^.wf in the X/0 frequency 
case for art, we found occurrences of this feature type in held-out data in the 1/0, 2/0 and 
3/0 cases, but not the rest (4/0 and higher). In this case it is necessary to use interpolation 
for the missing data points, and we applied log-linear interpolation. The interpolation 
also offers additional benefits. Firstly, using the slope of the interpolated line we can 
detect anomalous data (such as cases where 1/0 gets higher smoothed values than 5/0) as 
we always expect a positive slope, that is, higher ratios deserve higher smoothed values. 
Secondly, interpolation can be used to override a minority of data points which contradict 
the general trend. These points will be illustrated in the examples presented in Section 4.3 . 

However, when using interpolation, we need at least two or three data points for 
all feature types. For feature types with few points, we apply a back-off strategy: we 
join the available data for all words in the same Part of Speech. The rationale for this 
grouping is that strong features for a noun should be also strong for other nouns. In order 
to decide whether we have enough data for a feature type or not, we use the number of 
data points (minimum of three) available for interpolation. In order to check the validity 
of the interpolation, those cases where we get negative slope are discarded. 



4 Feature- Type Smoothing Algorithm 

There are two steps in the application of the smoothing algorithm to the disambiguation 
task. First, we use the available training data in cross-validation, with an interpolation 
method, in order to estimate the smoothing tables for each feature type with X/0 or X/1 
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Table 1. Smoothing table for the feature prev_N_wf and the word art (X/0 distribution) 



Original 


Held-out 1 


1 Accumulated 




Interpolated 


X Y 


X’ Y’ 


X7Y’ 


X’ Y’ 


X7Y’ 


log(X7Y’) 


X" 


Y" 


X"/Y" 


log(X"/Y") 


1 0 


4 4 


1 


4 4 


1.00 


0.00 


T" 


0.91 


1.10 


0.09 


2 0 


6 1 


6 


10 5 


2.00 


0.69 


2 


1.18 


1.69 


0.52 


3 0 


2 0 


oo 


12 5 


2.4 


0.88 


3 


1.14 


2.63 


0.96 














4^ 


0.98 


4.08 


1.40 



raw frequency. Second, the interpolated tables are accessed on the disambiguation phase, 
when the WSD methods require them. Sections 4.1 and 4.2 present the algorithms, and 
Section 4.3 shows some illustrative examples. 

4.1 Building Smoothing Tables 

We build two kinds of smoothing tables. The first kind is the application of the grouping 
strategy based on feature types and frequency distributions. Two tables are produced; 
one at the word level, and another at the PoS level, which we will call smoothed tables. 
The second kind is the result of the interpolation method over the two aforementioned 
tables, which we will call interpolated tables. All in all, four tables are produced in two 
steps for each frequency distribution (X/0 and X/1). 

1) Construct Smoothing Tables for Each Target Word and for Each PoS. For each 
feature type (e.g. : prevJNjwf), we identify the instances that have X/0 or X/1 distributions 
(e.g. p rev JSf_wf Aboriginal) and we count collectively their occurrences per sense. We 
obtain tables with (X’,Y’) values for each word, feature type and pair (X,Y); where 
(X,Y) indicate the values seen for each feature in the training part, and (X’,Y’) represent 
the counts for all the instances of the feature type with the same (X,Y) distribution in 
the held-out part. 

We perform this step using 5-fold cross-validation on the training data. We separate in 
a stratified way^ fhe training data in two parts: estimation-fold (4/5 of the data) and target- 
fold (1/5 of the data), which plays the role of the held-out data. We run the algorithm five 
times in furn, until each part has been used as target. The algorithm is described in detail 
in Figure 1 for the X/0 case (the X/1 case is similar). Note that the X count corresponds 
to the majority sense for the feature, and the Y count to all the rest of minority senses 
for the feature. For example, we can see in the held-out columns in Table 1 the (X’,Y’) 
counts obtained for the feature type prevJSf-wf and the target word art in the Senseval-2 
training data for the X/0 cases. 

2) Create Interpolation Curves. From the smoothing tables, we interpolate curves for 
feature types that have at least 3 points. The process is described in detail in the second 
part of Figure 1 . We first accumulate the counts in the smoothed table from the previous 
step. The “Accumulated” columns in Table 1 show these values, as well as the X/Y ratio 



^ By stratified, we mean that we try to keep the same proportion of word senses in each of the 5 
folds. 
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1. Construct word smoothing tables for X/0 (XO) 

- For each fold from training-data (5 folds) 

Build count{f,w, sense) for all senses from the estimation-folds (4 folds) 

For each word w, for each feature f in each occurrence in target-fold (1 fold) 
get count' {f, w, sense) for all senses of w in target-fold 
If distribution of count' {f,w, sense) is of kind X/0 (XO) then 
For each sense 

if sense = s. maxs count{f,w,s) 

then # sense is major sense in estimation-fold 

increment X' in table_word_XO (w, type ( f ) , X) 
else 

increment Y' in table_word_XO (w, type ( f ) , X) 

- Normalize all tables: X' is set to X, and Y' := Y'X'/X 

Output (No need to keep X'): normtable_word_XO{w,type{f),X) \—Y' 

2 . Log linear Interpolation 

- Accumulate X' and Y' values 

- Map into linear space: 
logtable_word_XO{w, type{f), X) : — 

log{acctable_word_XO{w, type{f), X).X' /acctable_word_XO{w, type{f), X).Y') 

- Do linear interpolation of logtable: sourcepoint{w,type{f)) — uq , 

gradient{w, type{f)) — ai 

- For each X from 1 to oo 

interpolatedtable_word_XO{w,type{f),X) X/{e^o+°'i^) 



Fig. 1. Construction of smoothing tables for X/0 features for words. The X/1 and PoS tables are 
built similarly 



and its logarithm. The Y value is then normalized, and mapped into the logarithmic space. 
We apply a common linear interpolation algorithm called least square method [11], 
which yields the starting point and slopes for each interpolation table. If we get a negative 
slope, we discard this interpolation result. Otherwise, we can apply it to any X, and after 
mapping again into the original space we get the interpolated values ofY, which we denote 
Y". Table 1 shows the Y" values, the X’YY” ratios, and the log values we finally obtain 
for the prev-N-wf example for art for X = 1..4 and Y = 0 (“Interpolated” columns). 
The X”/Y”ratios indicate that for X values lower than 4, the feature type is not reliable, 
but for X >= 4 and Y = 0, this feature type can be used with high conhdence for art. 

4.2 Using the Smoothed Values 

The process to use the smoothed values in testing is described in Figure 2. There 
we see that when we find X/0 or X/1 distributions, the algorithm resorts to the ob- 
tainsmoothed-value function to access the smoothing tables. The four tables constructed 
in the previous section are all partial, i.e. in some cases there is no data available for some 
of the senses. The tables are consulted in a hxed order: we first check the interpolated 
table for the target word; if it is not available for the feature type, we access the interpo- 
lated table for the PoS of the target word. Otherwise, we resort to the non-interpolated 
smoothing table at the word level. Finally we access the non-interpolated smoothing 
table for the PoS. 

In cases were the four tables fail to provide information, we can beneht from addi- 
tional smoothing techniques. The three ML methods that we have applied have different 
smoothing requirements, and one of them (NB) does need a generally applicable smooth- 
ing technique: 
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Given an occurrence of a word w in testing, for each feature f in the context: 
Get count{f,w, sense) for all senses from all training (all 5 folds) 

If counts are not X/1 or X/0 then 
For each sense: 

count' {f, w, sense) := count{f, w, sense) 

Elseif count is X/Y (where Y is 1 or 0) then 
If Y' — obtain ^smoothed _v alue{X , K) 

Then 

For each sense 

If sense = s. maxs count{f,w,s) then # (MAJOR SENSE) 

count' {f , w, sense) — X 

Elsif sense = 2nd_sense then #(ONLY IF Y=l, WHERE A MINORITY SENSE 

OCCURS ONCE) 

count' {f,w, sense) :—Y' #(SECOND SENSE GETS MORE CREDIT) 

Else 

count' {f, w, sense) Y' /\othersenses\ # (DISTRIBUTE WEIGHT UNIFORMLY 

AMONG MINOR SENSES) 

Else # (THERE IS NO SMOOTHING DATA FOR THIS X/Y) 

DISCARD #(THIS IS POSSIBLE FOR DL) 

For each sense 

If sense = s. maxs count{f,w,s) then # (MAJOR SENSE) 

count' {f,w, sense) X 

Elsif sense = 2nd_sense then #(ONLY IF Y=l, WHERE A MINORITY SENSE 

OCCURS ONCE) 

count' {f,w, sense) 1 # (SECOND SENSE GETS MORE CREDIT) 



Fig. 2. Application of Feature-type smoothing to DL, NB and VSM 



DL: as it only uses the strongest piece of evidence, it can discard X/0 features. It does 
not require X/1 smoothing either. 

NB: It needs to estimate all single probabilities, i.e. all features for all senses, therefore 
it needs smoothing in X/0, X/1 and even X/2 and larger values of Y. The reason is that in 
the case of polisemy degrees larger than 2, the rare senses might not occur for the target 
feature and could lead to infinite values in Equation (2). 

VSM: it has no requirement for smoothing. 

In order to check the impact of the various smoothing possibilities we have devised 
6 smoothing algorithms to be applied with the 3 ML methods (DL, NB, and VSM). 
We want to note that not requiring smoothing does not mean that the method does not 
profit from the smoothing technique (as we shall see in the evaluation). Lor the baseline 
smoothing strategy we chose both “no smoothing”, and “hxed smoothing”; we also tried 
a simple but competitive method from [12], denoted as “Ng smoothing” (methods to be 
described below). The other three possibilities consist on applying the Leature-Type 
method as in Ligure 2, with two variants: use “Ng smoothing” for back-off (E), or in a 
combined fashion (L): 

(A) No smoothing: Use raw frequencies directly. 

(B) Lixed smoothing: Assign 0.1 raw frequency to each sense with a 0 value. 

(Ng) Ng smoothing: This method is based on the global distribution of the senses in the 
training data. Lor each feature, each of the senses of the target word that has no occur- 
rences in the training data gets the ratio between the probability of the sense occurring in 
the training data and the total number of examples: Prob{sense)/NumberMf-examples. 
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Table 2. Smoothed values (interpolation per word) for the feature types prev_N_wf, 
winxontjemxontext and win_2gramxontext with the target word art 





1 prevJJjwf 1 


winxontjemxontext 


win J2 gram xontext \ 


X Y 


X’ Y’ 


X" Y" 


X’ Y’ 


X" Y" 


X’ Y’ 


X" Y" 


1 0 


4 4 


1 0.91 


517 1187 


1 2.24 


63 150 


1 2.31 


2 0 


6 1 


2 1.18 


82 125 


2 4.45 


8 4 


2 4.37 


3 0 


2 0 


3 1.14 


13 22 


3 6.62 


2 1 


3 6.48 



(Ft) Feature-type smoothing; The method described in this paper. In the case of DL, note 
that when no data is available the feature is just discarded. For NB, it is necessary to rely 
in back-off strategies (see E and F). 

(E) Ft with Ng as back-off: When Ft does not provide smoothed values, Ng is applied. 

(F) Ft and Ng combined: The smoothed values are obtained by multiplying Ft and Ng 
values. Thus, in Figure 2, the count' {f, w, sense) values are multiplied by 

Prob( sense )/Number_of -examples. 

The output of the smoothing algorithm is the list of counts that replace the original 
frequency counts when computing the probabilities. We tested all possible combinations, 
but notice that not all smoothing techniques can be used with all the methods (e.g. we 
cannot use NB with “no smoothing”). 

4.3 Application of Smoothing: An Example 

We will focus on three feature types and the target word art in order to show how 
the smoothed values are computed. For art, the following features have a 1/0 distribu- 
tion in the training data: “prev-Njwf Aboriginal” , ‘‘winxontjemxontext Jerry”, and 
‘‘winJ2gramxontext collection owned”''' . The majority sense for the three cases is the 
first sense. If we find one of those features in a test occurrence of art, we would like to 
know whether they are good indicators of the first sense or not. 

As all these features occur with frequency 1/0, we have collected all counts for the fea- 
ture types (e.g. prev-N.wf) which also have 1/0 occurrences in the training data. Table 1 
shows the counts for prevJJjwf', the (4,4) values that appear for (X’,Y’) indicate that the 
prevJJ-wf features that have 1/0 distribution in the target-folds contribute 4 examples to 
the majority sense and 4 to the minority senses when looked up in the estimation-folds. 

The data for prev-N-wf has at least 3 points, and therefore we use the accumulated 
frequencies to obtain an interpolation table. We see that the interpolated frequencies for 
the minority senses stay nearly constant when the X values go up. This would reflect 
that the probability of the minority senses would go down quickly for higher values of 
X. In fact, the interpolated table can be used for values of X greater that 3, which had 
not been attested in the training data. 



The first feature indicates that Aboriginal was the first noun to the left of art. The second that 
Jerry was found in the context window. The third that the bigram collection owned was found 
in the context window. 
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Fig. 3. Interpolation curves for the X/0 case (features prev_N_wf and win.context) with the target 
word art. The Y” estimation and the log{X" /Y”) values are given for each X value and feature 



The same process is followed for the other two feature types: win_cont_lem_context 
and winJ2gram_context. Table 2 shows the smoothed values (X’,Y’) and the interpolated 
values (X”,Y”) for the three types studied. The values for Y are much higher in the latter 
two cases, indicating that there is a very low confidence for these features for the word art. 
In contrast, prevJY.wf can be a valuable feature if found in 4/0 or greater distributions. 

Figure 3 shows this different behavior graphically for win_cont_lem_context and 
prevJSfjwf. For each feature type, the estimated Y” values and the log-ratio of the ma- 
jority sense are given: the higher the Y” the lower the confidence in the majority sense, 
and inversely for the log-ratio. We can see that the curve for the Y” values assigned 
to prevJYxvf get lower credit as X increases, and the log-ratio grows constantly. On 
the contrary, for win-contJem-context the values of Y” increase, and that the log-ratio 
remains below zero, indicating that this feature type is not informative. 



5 Results 

The main experiment is aimed at studying the performance of four ML methods with the 
different smoothing approaches (where applicable). The recall achieved on the Senseval- 
2 dataset is shown in Table 3, the best results per method marked in bold. We separated the 
results according to the type of smoothing: basic smoothing (“no smoothing” and “hxed 
smoothing”), and complex smoothing (techniques that rely on “Feature-type smoothing” 
and “Ng smoothing”). We can see that the results are different depending on the ML 
method, but the best results are achieved with complex smoothing for the 3 ML methods 
studied: DL (Ft and E), NB (F), and VSM (Ng). The best performance is attained by the 
VSM method, reaching 66.2%, which is one of the highest reported in this dataset. The 
other methods get more profit from the smoothing techniques, but their performance is 
clearly lower. McNemar’s test^ shows that the difference between the results of the best 
“basic smoothing” technique and the best “complex smoothing” technique is significant 
for DL and NB, but not for VSM. 

All in all, we see that the performance of the statistically-based (DL, NB) methods 
improves significantly, making them comparable to the best single methods. In the next 



^ McNemar’s significance test has been applied with a 95% confidence interval. 
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Table 3. ML methods and smoothing techniques: (A) no smoothing, (B) fixed smoothing, (Ng) 
Ng smoothing, (Ft) Feature-type smoothing, the method presented in this paper, (E) Ft with Ng 
as back-off, and (F) the combination of Ft and Ng 





1 Basic Smoothing 


1 Complex Smoothing 




A 


B 


Ng 


Ft 


E 


F 


DL 




60.4 


60.7 


64.4 


64.4 


64.3 


NB 




62.9 


63.5 




61.8 


63.8 


VSM 


65.9 


65.6 


66.2 


64.0 


64.2 


65.2 


SVM 


65.8 













Table 4. Combination of systems with basic smoothing and complex smoothing. The rows show 
the recall achieved combining the 4 systems, and discarding one in turn 



Systems 


Basic smoothing 


Complex smoothing 


All methods 


65.7 


66.2 


except SVM 


64.9 


66.2 


except NB 


66.0 


66.7 


except VSM 


64.9 


65.7 


except DL 


65.7 


66.3 



experiment, we tested a simple way to combine the output of the 4 systems: one system, 
one vote. The combination was tested on 2 types of systems: those that relied on “complex 
smoothing”, and those that not. For each algorithm, the best smoothing technique for 
each type was chosen; e.g. the VSM algorithm would use the (A) approach for “simple 
smoothing”, and (Ng) for “complex smoothing” (see Table 3). The performance of these 
systems is given in Table 4. The table also shows the results achieved discarding one 
system in turn. 

The results show that we get an improvement over the best system (VSM) of 0.5% 
when combining it with DL and S VM. The table also illustrates that smoothing accounts 
for all the improvement, as the combination of methods with simple smoothing only 
reaches 66.0% in the best case, for 66.7% of the “complex smoothing” (difference 
statistically significant according to McNemar’s test with 95% confidence interval). 

As a reference. Table 5 shows the results reported for different groups and algorithms 
in the Senseval-2 competition and in more recent works. Our algorithms are identified 
by the “IXA” letters. “JHU - S2”, corresponds to the Johns Hopkins University system 
in Senseval-2, which was the best performing system. “JHU” indicates the systems 
from the Johns Hopkins University implemented after Senseval-2 [3, 5]. Finally, “NUS” 
(National University of Singapore) stands for the systems presented in [9]. The Table is 
sorted by recall. 

We can see that our systems achieve high performance, and that the combination of 
systems is able to beat the best results. However, we chose the best smoothing algorithm 
for the methods using the testing data (instead of using cross-validation on training, 
which would require to construct the smoothing tables for each fold). This fact makes 
the combined system not directly comparable. In any case, it seems clear that the system 
benefits from smoothing, and obtains results similar to the best figures reported to date. 
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Table 5. Comparison with the best systems in the Senseval-2 competition and the recent literature 



Method 


Group 


Smoothing 


Recall 


Combination 


IXA 


Complex (best) 


66.7 


Combination 


JHU 




66.5 


VSM 


IXA 


Ng 


66.2 


Combination 


IXA 


Basic (best) 


66.0 


SVM 


IXA 




65.8 


SVM 


NUS 




65.4 


DL 


IXA 


Et 


64.4 


Combination 


JHU-S2 




64.2 


NB 


IXA 


E 


63.8 


NB 


NUS 


“Add one” 


62.7 



Best result to date 



2nd best result to date 
Senseval-2 winner 



6 Conclusions 

In this work, we have studied the smoothing method proposed in [15], and we present a 
detailed algorithm for its application to WSD. We have described the parameters used, 
and we have applied the method on three different ML algorithms: Decision Lists (DL), 
Naive Bayes (NB), and Vector Space Model (VSM). We also analyzed the impact of 
several smoothing strategies. The results indicate that the smoothing method explored in 
this paper is able to make all three methods perform at very high precisions, comparable 
and in some cases superior to the best result attained in the Senseval-2 competition, 
which was a combination of several systems. We also show that a simple combination 
of the methods and a fourth system based on Support Vector Machines (SVM) attains 
the best result for the Senseval-2 competition reported so far (although only in its more 
successful configuration, as the system was not “frozen” using cross-validation). At 
present, this architecture has also been applied in the Senseval-3 competition, with good 
results, only 0.6% below the best system for English [1]. 

For the future, we would like to extend this work to X/Y features for Y greater than 1 , 
and try other grouping criteria, e.g. taking into account the class of the word. We would 
also like to compare our results to other more general smoothing techniques [6, 7, 2]. 

Finally, we would like to apply the smoothing results to detect good features for 
bootstrapping, even in the case of low amounts of training data (as it is the case for most 
of the words in WSD). The DL method, which improves significantly with smoothing, 
may be well suited for this task, as it relies on one single piece of evidence (feature) to 
choose the correct sense. 
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Abstract. Search engines have become the primary means of accessing informa- 
tion on the Weh. However, recent studies show misspelled words are very common 
in queries to these systems. When users misspell query, the results are incorrect 
or provide inconclusive information. In this work, we discuss the integration of 
a spelling correction component into tumha!, our community Weh search engine. 
We present an algorithm that attempts to select the best choice among all possi- 
ble corrections for a misspelled term, and discuss its implementation based on a 
ternary search tree data structure. 



1 Introduction 

Millions of people use the Web to obtain needed information, with search engines cur- 
rently answering tens of millions of queries every day. However, with the increasing 
popularity of these tools, spelling errors are also increasingly more frequent. Between 
10 to 12 percent of all query terms entered into Web search engines are misspelled [10]. 
A large number of Web pages also contain misspelled words. Web search is thus a task of 
information retrieval in an environment of faulty texts and queries. Even with misspelled 
terms in the queries, search engines often retrieve several matching documents - those 
containing spelling errors themselves. However, the best and most “authoritative” pages 
are often missed, as they are likely to contain only the correctly spelled forms. An inter- 
active spelling facility that informs users of possible misspells and presents appropriate 
corrections to their queries could bring improvements in terms of precision, recall, and 
user effort. Google was the first major search engine to offer this facility [8]. 

One of the key requirements imposed by the Web environment on a spelling checker is 
that it should be capable of selecting the best choice among all possible corrections for a 
misspelled word, instead of giving a list of choices as in word processor spelling checking 
tools. Users of Web search systems already give little attention to query formulation, 
and we feel that overloading them with an interactive correction mechanism would not 
be well accepted. It is therefore important to make the right choice among all possible 
corrections autonomously. 

This work presents the development of a spelling correction component for tumba!, 
our community search engine for the Portuguese Web [29]. In tumba! we check the query 
for misspelled terms while results are being retrieved. If errors are detected, we provide 
a suggestive link to a new “possibly correct” query, together with the search results for 
the original one. 



J. L. Vicedo et at. (Eds.): EsTAL 2004, LNAI 3230, pp. 372-383, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The rest of this paper is organized as follows: the next section presents the terminology 
used throughout this work. Section 3 gives an overview on previous approaches to 
spelling correction. Section 4 presents the ternary search tree data structure, used in 
our system for storing the dictionary. Section 5 details our algorithm and the heuristics 
behind it. Section 6 describes the data sources used to build the dictionary. Section 7 
describes experimental results. Finally, Section 8 points our conclusions and directions 
for future work. 

2 Terminology 

Information Retrieval (IR) concerns with the problem of providing relevant documents 
in response to a user’s query [2]. The most commonly used IR tools are Web search 
engines, which have become a fact of life for most Internet users. Search engines use 
software robots to survey the Web, retrieving and indexing HTML documents. Queries 
are checked against the keyword indexes, and the best matches are returned. 

Precision and Recall are the most popular metrics in evaluating IR systems. Precision 
is the percentage of retrieved documents that the searcher is actually interested on. Recall, 
on the other hand, is the percentage of relevant documents retrieved from the set of all 
documents, this way referring to how much information is retrieved by the search. The 
ultimate goal of an information retrieval system is to achieve recall with high precision. 

Spelling has always been an issue in computer-based text tools. Two main problems 
can be identified in this context: Error detection, which is the process of finding mis- 
spelled words, and Error correction, which is the process of suggesting correct words 
to a misspelled one. Although other approaches exist, most spelling checking tools are 
based on a dictionary which contains a set of words which are considered to be cor- 
rect. The problem of spelling correction can be defined abstractly as follows: Given an 
alphabet ct, a dictionary D consisting of strings in ct* and a string s, where s ^ D and 
s G a*, find the word w G D that is most likely to have been erroneously input as s. 

Spelling errors can be divided into two broad categories: typographic errors, which 
occur because the typist accidentally presses the wrong key, presses two keys, presses the 
keys in the wrong order, etc; and phonetic errors, where the misspelling is pronounced 
the same as the intended word but the spelling is wrong. Phonetic errors are harder to cor- 
rect because they distort the word more than a single insertion, deletion or substitution. In 
this case, we want to be able to key in something that sounds like the misspelled word (a 
“phonetic code”) and perform a “fuzzy” search for close matches. The search for candi- 
date correct forms can be done at typographic level, and then refined using this method. 

3 Related Work 

Web information retrieval systems have been around for quite some time now, having 
become the primary means of accessing information the Web [1,8]. Early systems en- 
gines did not check query spelling but since April 200 1 , several web wide search engines, 
including Excite and Google, provide dynamic spelling checking, while others such as 
Yahoo, simply tracked common misspellings of frequent queries, such as movie star 
names. Technical details for these systems are unavailable, but they seem to be based on 
spelling algorithms and statistical frequencies. 
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Algorithmic techniques for detecting and correcting spelling errors in text has also a 
long and robust history in computer science [20]. Previous studies have also addressed 
the use of spelling correctors in the context of user interfaces [13]. Spelling checkers 
(sometimes called “spell checkers” by people who need syntax checkers) are nowadays 
common tools for many languages, and many proposals can also be found on the liter- 
ature. Proposed methods include edit distance [11,31,21], rule-based techniques [32], 
n-grams [25, 33]probabilistic techniques [18], neural nets [28, 6, 17], similarity key tech- 
niques [34,23], or combinations [16,22]. All of these methods can be thought of as 
calculating a distance between the misspelled word and each word in the dictionary. The 
shorter the distance, the higher the dictionary word is ranked as a good correction. 



Dictionary 



User 



C|a 


n|a|d|a 


\ 








>dn|a|d 


aj 



E|n|g 


l|f 


n|d 




X 




E|n|g 


all 


n Id 



A|m|e 


r 


i|c|a 












ik|a 



Transposition Wrong Letter Extra Letter 

Fig. 1. The four most common spelling errors 




Missing Letter 



Edit distance is a simple technique. The distance between two words is the number 
of editing operations required to transform one into another. Analysis of errors - mainly 
typing errors - in very large text files have found that the great majority of wrong spellings 
(80-95%) differ from the correct spellings in Just one of the four ways described in 
Figure 1 . The editing operations to consider should therefore correspond to these four 
errors, and candidate corrections include the words that differ from the original in a 
minimum number of editing operations [11]. Recent works are experimenting with 
modeling more powerful edit operations, allowing generic string-to-string edits [7]. 
Additional heuristics are also typically used to complement techniques based on edit 
distance. For instance, in the case of typographic errors, the keyboard layout is very 
important. It is much more usual to accidentally substitute a key by another if they are 
placed near each other on the keyboard. 

Similarity key methods are based on transforming words into similarity keys that 
reflect their characteristics. The words in the dictionary and the words to test are both 
transformed into similarity keys. All words in the dictionary sharing the same key with 
a word being tested are candidates to return as corrections. An example of this method 
is the popular Soundex system. Soundex (the name stands for “Indexing on sound”) was 
devised to help with the problem of phonetic errors [12, 19]. It takes an English word 
and produces a four digit representation, in a rough-and-ready way designed to preserve 
the salient features of the phonetic pronunciation of the word. 

The metaphone algorithm is also a system for transforming words into codes based 
on phonetic properties [23, 24]. However, unlike Soundex, which operates on a letter-by- 
letter scheme, metaphone analyzes both single consonants and groups of letters called 
diphthongs, according to a set of rules for grouping consonants, and then mapping 
groups to metaphone codes. The disadvantage of this algorithm is that it is specific 
to the English language. A version of these rules for the Portuguese language has, to 
the best of our knowledge, not yet been proposed. Still, there has been recent research 
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on machine learning methods for letter-to-phoneme conversion [15, 30]. Application of 
these techniques to Portuguese should be straightforward, providing he have enough 
training data. 

More recent studies on error correction propose the use of context, attempting to 
detect words which are misused but spelled correctly [5, 14]. Spelling checkers based 
on isolated word methods would see the following sentence as correct: a paragraph cud 
half mini flaws but wood bee past by the spill checker. However, since in search engines 
users oddly type more than tree terms for a query, it would be a waste to make context 
dependent correction. Isolated word methods should prove sufficient for our task. 

4 Ternary Search Trees 

In this work, we use a ternary search tree (TST) data structure for storing the dictionary in 
memory. TSTs are a type of trie that is limited to three children per node [3, 4]. Trie is the 
common definition for a tree storing strings, in which there is one node for every common 
prefix and the strings are stored in extra leaf nodes. TSTs have been successfully used for 
several years in searching dictionaries. Search times in this structure are 0{log{n) + k) 
in the worst case, where n is the number of strings in the tree and k is the length 
of the string being searched for. In a detailed analysis of various implementations of 
trie structures, the authors concluded that “Ternary Search Tries are an effective data 
structure from the information theoretic point of view since a search costs typically about 
login) comparisons on real life textual data. [...] This justifies using ternary search tries 
as a method of choice for managing textual data"' [9]. 

Figure 2 illustrates a TST. The structure stores key-value pairs, where keys are the 
words and values are integers corresponding to the word frequency. As we can see, each 
node of the tree stores one letter and has three children. A search compares the current 
character in the search string with the character at the node. If the search character comes 
lexically first, the search goes to the left child; if the search character comes after, the 
search goes to the right child. When the search character is equal, the search goes to the 
middle child, and proceeds to the next character in the search string. 









Fig. 2. A ternary search tree storing the words “to”, “too”, “toot”, “tab” and “so”, all with an 
associated frequency of 1 
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TSTs combine the time efficiency of tries with the space efficiency of binary search 
trees. They are faster than hashing for many typical search problems, and support a hroad 
range of useful operations, like finding all keys having a given prefix, suffix, or infix, or 
finding those keys that closely match a given pattern. 

5 Spelling Correction Algorithm 

A TST data structure stores the dictionary. For each stored word, we also keep a fre- 
quency count, originally obtained from the analysis of a large corpora. To choose among 
possible corrections for a misspelled word, we use these word frequency counts as a 
popularity ranking, together with other information such as metaphone keys. Although 
we do not have a specific text-to-phoneme algorithm for the Portuguese language, using 
the standard metaphone algorithm yields in practice good results. 

Queries entered in the search engine are parsed and the individual terms are extracted, 
with non word tokens ignored. Each word is then converted to lower case, and checked to 
see if it is correctly spelled. Correctly spelled words found in user queries are updated in 
the dictionary, by incrementing their frequency count. This way, we use the information 
in the queries as feedback to the system, and the spelling checker can adapt to the patterns 
in user’s searches by adjusting its behavior. For the misspelled words, a correctly spelled 
form is generated. Finally, a new query is presented to the user as a suggestion, together 
with the results page for the original query. By clicking on the suggestion, the user can 
reformulate the query. 

Our system integrates a wide range of heuristics and the algorithm used for making 
the suggestions for each misspelled word is divided in two phases. In the first, we generate 
a set of candidate suggestions. In the second, we select the best. 

The first phase of the algorithm can be further decomposed into 9 steps. In each step, 
we look up the dictionary for words that relate to the original misspelling, under specific 
conditions: 

1 . Differ in one character from the original word. 

2. Differ in two characters from the original word. 

3. Differ in one letter removed or added. 

4. Differ in one letter removed or added, plus one letter different. 

5. Differ in repeated characters removed. 

6. Correspond to 2 concatenated words (space between words eliminated). 

7. Differ in having two consecutive letters exchanged and one character different. 

8. Have the original word as a prefix. 

9. Differ in repeated characters removed and 1 character different 

In each step, we also move on directly to the second phase of the algorithm if one 
or more matching words are found (i.e., if there are candidate correct forms that only 
differ in one character from the original misspelled word, a correct form that differs in 
more characters and is therefore more complex will never be chosen). 

In the second phase, we start with a list of possible corrections. We then try to select 
the best one, following these heuristics: 
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1. If there is one solution that differs only in accented characters, we automatically 
return it. Typing words without correct accents is a very common mistake in the 
Portuguese language (20% according to Medeiros [22]). 

2. If there is one solution that differs only in one character, with the error corresponding 
to an adjacent letter in the same row of the keyboard (the QWERTY layout is 
assumed), we automatically return it. 

3. If there are solutions that have the same metaphone key as the original string, we 
return the smallest one, that is, the one with less characters. 

4. If there is one solution that differs only in one character, with the error corresponding 
to an adjacent letter in an adjacent row of the keyboard, we automatically return it. 

5. In the last case, we return the smallest word. 

We follow the list of heuristics sequentially, and only move to the next if no matching 
words are found. If there is more than one word satisfying the conditions for each 
heuristic, we first try to return the one where the first character is equal to the correctly 
spelled word. If there is still more than one word, we return the one that has the highest 
frequency count. 

6 Data Sources and Associated Problems 

The dictionary for the spelling checking system is a normal text file, where each line 
contains a term and its associated frequency. The sources of Portuguese words and word 
frequencies for the dictionary were the texts from the Natura-Publico and the Natura- 
Minho corpora [27, 26]. The first one is made of the two first paragraphs of news articles 
from Publico in the years of 1991, 1992, 1993 and 1994. The second, corresponds to the 
full articles in 86 days of editions of the newspaper Diario do Minho, spread across the 
years of 1998 and 1999. 

The dictionary strongly affects the quality of a spelling checking system. If it is too 
small, not only will the candidate list for misspellings be severely limited, but the user 
will also be frustrated by too many false rejections of words that are correct. On the 
other hand, a lexicon that is too large may not detect misspellings when they occur, due 
to the dense “word space”. 

News articles capture the majority of the words commonly used, as well as technical 
terms, proper names, common foreign words, or references to entities. However, such 
large corpora often contain many spelling errors [26] . We use word frequencies to choose 
among possible corrections, which to some extent should deal with this problem. As 
misspelled terms are, in principle, less frequent over the corpus than their corresponding 
correct form, only on rare occasions should the spelling checker provide an erroneous 
suggestion. 

The Web environment introduces difficulties. It is general in subject, as opposed to 
domain specific, and multilingualism issues are also common. While spelling checkers 
in text editors use standard and personal dictionaries, search engine spelling checkers 
should be more closely tied to the content they index, providing suggestions based on 
the content of the corpus. This would avoid the dead-end effect of suggesting a word that 
is correctly spelled but not included in any words on the site, and add access to names 
and codes which will not be in any dictionary. However, using a search engine’s inverted 
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index as the basis of the spelling dictionary only works well when the content has been 
copy edited, or when an editor is available to check the word list and reject misspellings. 

7 Evaluation Experiments 

Some experiments were performed in order to quantitatively evaluate our spelling cor- 
rection mechanism. 

We were first interested in evaluating the quality of the proposed suggestions. To 
achieve this, we compared the suggestions produced by our spelling checker against 
Aspell - see the project homepage at http ; //aspell . sourceforge.net/. Aspell 
is a popular interactive spelling checking program for Unix environments. Its strength 
comes from merging the metaphone algorithm with a near miss strategy, this way cor- 
recting phonetic errors and making better suggestions for seriously misspelled words. 
The algorithm behind Aspell is therefore quite similar to the one used in our work, and 
the quality of the results in both systems should be similar. 

We used a hand-compiled list of 120 common misspellings, obtained from 
CiberDuvidas da Lingua Portuguesa (http : / /ciberduvidas . sapo .pt/\-php/ 
\-glossario .php ) and by inspecting the query logs for the search engine. The ta- 
ble below shows the list of misspelled terms used, the correctly spelled word, and the 
suggestions produced. In the table, a means that the algorithm did not detect the 
misspelling and a means the algorithm failed in returning a suggestion. 



Correct Form 


Spelling Error 


Our Algorithm 


ASpell 


ameixial 


ameixeal 


ameixial 


ameixial 


artifice 


artifece 


artifice 


artifice 


camoniano 


camoneano 


camoniano 


camoniano 


definido 


defenido 


definido 


defendo 


lampiao 


lampeao 


lampiao 


lampiao 


oficina 


ofecina 


oficina 


oficina 


acerca 


acerca 


acerca 


acerca 


agoriano 


a^oreano 


a9oriano 


coreano 


alcoolemia 


alcoolemia 


* 


* 


antepor 


antepor 


* 


antepor 


arctico 


artico 


artigo 


aortico 


antarctico 


antartico 


catartico 


antarctico 


bainha 


bamha 


bainha 


bainha 


bebe 


bebe 


bebe 


bebe 


bege 


beje 


* 


beije 


bengao 


bengao 


* 


* 


beneficencia 


beneficiencia 


beneficencia 


beneficencia 


biopsia 


biopsia 


* 


* 


burburinho 


borborinho 


burburinho 


burburinho 


caiem 


caem 


* 


* 


calvicie 


cal vice 


calvicie 


calvicie 


campeao 


campiao 


campeao 


campeao 


comboio 


comboio 


comboio 


comboio 


compor 


compor 


=1= 


compor 


comummente 


comumente 


comovente 


comummente 


constituia 


constituia 


* 


* 


constituiu 


constituiu 


constituiu 


constituiu 


cor 


cor 


* 


* 


cr^io 


craneo 


cranio 


carneo 


despretensioso 


despretencioso 


despretensioso 


despretensioso 


pretensioso 


pretencioso 


pretensioso 


pretensioso 


definigao 


defenigao 


defini^ao 


defini^ao 


Continued on next page | 
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Table 1. - Continued from previous page 



Correct Form 


Spelling Error 


Our Algorithm 


ASpell 


definir 


defenir 


definir 


definir 


desiquilibrio 


desequilibrio 


desequilibrio 


desequilibrio 


dispender 


despender 


* 


* 


dignatarios 


dignitarios 


dignatarios 


digitarias 


dispendio 


dispendio 


* 


dispendio 


ecra 


ecran 


* 


ecran 


emirados 


emiratos 


estratos 


meritos 


esquisito 


esquesito 


esquisito 


esquisito 


estratego 


estratega 


* 


* 


feminino 


femenino 


feminino 


feminino 


feminismo 


femininismo 


- 


feminismo 


for 


for 


* 


* 


gineceu 


geneceu 


gineceu 


gineceu 


gorjeta 


gorgeta 


gorjeta 


gorjeta 


granjear 


grangear 


granjear 


granjear 


guisar 


guizar 


guisar 


gizar 


hectare 


hectar 


* 


* 


halariedade 


hilaridade 


* 


* 


Hiroshima 


hiroxima 


aproxima 


proxima 


ilacgao 


ela9ao 


ila9ao 


ila9ao 


indispensavel 


indespensavel 


indispensavel 


indispensavel 


inflac§ao 


infla9ao 


* 


* 


interveio 


interviu 


intervir 


inter viu 


intervindo 


intervido 


=1= 


* 


invocar 


evocar 


* 


* 


ipsilon 


ipslon 


ipsilon 


ipsilon 


irisar 


irizar 


irisar 


razar 


irup9ao 


irrup 9 ao 


* 


* 


esoterico 


isoterico 


* 


* 


jeropiga 


geropiga 


- 


georgia 


juiz 


jui'z 


* 


juiz 


leem 


lem 


les 


lema 


linguista 


linguista 


* 


linguista 


lisonjear 


lisongear 


lisonjear 


lisonjear 


logotipo 


logotipo 


logo tipo 


logo tipo 


saem 


saiem 


saiam 


saem 


saloiice 


saloice 


baloice 


saloiice 


sarjeta 


sargeta 


sarjeta 


sarjeta 


semear 


semiar 


semear 


semear 


suf^a 


sui9a 


sui9a 


sui9a 


supor 


supor 


* 


supor 


rainha 


ramha 


rainha 


rainha 


raiz 


raiz 


* 


raiz 


raul 


raul 


raul 


raul 


redea 


redia 


redea 


radia 


regurgitar 


regurjitar 


regurgitar 


regurgitar 


rejeitar 


regeitar 


rejeitar 


regatar 


requeiro 


requero 


requere 


requeiro 


restia 


restea 


restia 


resta 


rectaguarda 


retaguarda 


* 


* 


rubrica 


rubrica 


* 


* 


quadricromia 


quadricomia 


- 


quadriculai 


quadruplicado 


quadriplicado 


quadruplicado 


quadruplicado 


quasimodo 


quasimodo 


- 


quisido 


quilo 


kilo 


* 


Nilo 


quilograma 


kilograma 


holograma 


holograma 


quilometro 


kilometro 


milimetro 


milimetro 


quis 


quiz 


quis 


qui 


paralisar 


paralizar 


paralisar 


paralisar 


perserveranga 


preseveran 9 a 


perseveran 9 a 


perseveran 9 a 


persuasao 


persua 9 ao 


persuasao 


persuasao 


persuasao 


presuasao 


persuasao 


persuasao 


pirineus 


pireneus 


* 


* 


privilegio 


previlegio 


privilegio 


privilegio 


Continued on next page | 
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Table 1. - Continued from previous page 



Correct Form 


Spelling Error 


Our Algorithm 


ASpell 


Oceania 


Oceania 


* 


* 


oprobrio 


oprobio 


aerobio 


proibo 


organograma 


organigrama 


* 


* 


nonagesimo 


nonagessimo 


nonagesimo 


nonagesimo 


maci^o 


massigo 


massico 


massico 


majestade 


magestade 


majestade 


majestade 


manjerico 


mangerico 


manjerico 


manjerico 


manjerona 


mangerona 


tangerina 


tangerina 


meteorologia 


metereologia 


meteorologia 


meteorologia 


miscigenagao 


miscegena^ao 


miscigena9ao 


miscigenagao 


transfuga 


transfuga 


transfira 


transfira 


transpor 


transpor 


* 


* 


urano 


lirano 


* 


* 


ventoinha 


ventoinha 


ventoinha 


ventoinha 


verosimil 


verosimel 


* 


* 


vigilante 


vegilante 


vigilante 


vigilante 


voo 


voo 


* 


* 


vultuoso 


vultoso 


* 


* 


xadrez 


xadres 


xadrez 


ladres 


xama 


chama 


chama 


chama 


xelindro 


xilindro 


cilindro 


cilindro 


chiita 


xiita 


* 


xiitas 


zangao 


zangao 


* 


* 


zepelin 


Zeppelin 


- 


zeplim 


zoo 


zoo 


zoo 


coo 



48.33% of the correct forms were correctly guessed and our algorithm outperformed 
Aspell hy a slight margin of 1.66%. On the 120 misspellings, our algorithm failed in 
detecting a spelling error 38 times, and it failed on providing a suggestion only 5 times. 
Note that the data source used to build the dictionary has itself spelling errors. A careful 
process of reviewing the dictionary could improve results in the future. 

Kukich points out that most researchers report accuracy levels above 90% when the 
first three candidates are considered instead of the first guess [20]. Guessing the one 
right suggestion to present to the user is much harder than simply identifying misspelled 
words and present a list of possible corrections. 

In the second experiment, we took some measures from the integration of our spelling 
checker with a search engine for the Portuguese Web. We tried to see if, by using the 
spelling correction component, there were improvements in terms of precision and recall 
in our system. Using a hand compiled list of misspelled queries, we measured the number 
of retrieved documents in the original query, and the number of retrieved documents in 
the transformed query. We also had an human evaluator accessing the quality of the first 
ten results returned by the search engine, that is, measuring how many documents in the 
first ten results were relevant to the query. 

Results confirm our initial hypothesis that integrating spelling correction in Web 
search tools can bring substantial improvements. Although many pages were returned 
in response to misspelled queries (and in some cases all pages were indeed relevant), 
the results for the correctly spelled queries were always of better quality and more 
relevant. 
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Table 2. Results from the Integration of the Spelling Checker With Tumba! 



Misspelled Query 


# Relevant Results 


Correct Query 


# Relevant Results 


camoneano 


5 


camoniano 


10 


a9oreano 


10 


a^oriano 


10 


calvice 


3 


calvicie 


10 


campiao 


9 


campeao 


10 


femenino 


9 


feminino 


10 


guizar 


6 


guisar 


10 


rainha 


10 


rainha 


10 


regurjitar 


0 


regurgitar 


10 


magestade 


9 


majestade 


10 


mangerico 


9 


manjerico 


10 


metereologia 


10 


meteorologia 


10 


vegilante 


0 


vigilante 


10 


xadres 


9 


xadrez 


10 


zoo 


0 


zoo 


10 



8 Conclusions and Future Work 

This paper presented the integration of a spelling correction component into tumba!, a 
Portuguese community Web search engine. The key challenge in this work was deter- 
mining how to pick the most appropriate spelling correction for a mistyped query from 
a number of possible candidates. 

The spelling checker uses a ternary search tree data structure for storing the dic- 
tionary. As source data, we used a large textual corpus of from two popular Por- 
tuguese newspapers. The evaluation showed that our system gives results of acceptable 
quality, and that integrating spelling correction in Web search tools can be beneficial. 
However, the validation work could be improved with more test data to support our 
claims. 

An important area for future work concerns phonetic error correction. We would like 
to experiment with machine learning text-to-phoneme techniques that could adapt to 
the Portuguese language, instead of using the standard metaphone algorithm [15,30]. 
We also find that queries in our search engine often contain company names, acronyms, 
foreign words and names, etc. Having a dictionary that can account for all these cases 
is very hard, and large dictionaries may result in inability to detect misspellings due to 
the dense “word space”. However, keeping two separate dictionaries, one in the TST 
used for correction and another in an hash-table used only for checking valid words, 
could yield interesting results. Studying ways of using the corpus of Web pages and the 
logs from our system, as the basis for the spelling checker, is also a strong objective for 
future work. Since our system imports dictionaries in the form of ASCII word lists, we 
do however have an infrastructure that facilitates lexicon management. 
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Abstract. This report presents a statistical study of WPT-03, a text corpus built 
from the pages of the “Portuguese Web” collected in the repository of the tumba! 
search engine. We give a statistical analysis of the textual contents available in the 
Portuguese Web, including size distributions, the language of the pages, and the 
terms they contain. 



1 Introduction 

This study provides a statistical analysis of the textual contents on the Web page repos- 
itory of the tumba! search engine [15]. More specifically, the source of information is 
the text extracted from a collection of documents from the “Portuguese Web”, during 
the first semester of 2003. This roughly comprises all the pages hosted under the .PT 
top level domain (TLD), and other pages written in Portuguese and hosted in other 
TLDs (excluding .BR because most of these pages are also written in the Portuguese 
language). 

The information presented in this study is of interest for the characterization of the 
textual contents of the Portuguese Web, as well as for future work within the scope of 
project tumba!. It is complemented by another report which provides statistics on the 
structure of the Portuguese Web [6]. 

The textual corpus is named WPT-03 and it is distributed by Linguateca (a re- 
source center for the the processing of the Portuguese language - http: //www. 
linguateca . pt) to researchers in the area of Natural Language Processing (NLP). 
For more information about the availability WPT-03, see the corresponding Web page 
athttp: //xldb. fc .ul .pt/linguateca/WPT_03 .html. 

The rest of this report is organized as follows: The next Section describes the WPT- 
03 corpus. In Section 3, we give statistics of the data in the corpus of web documents. 
Finally, Section 4 presents some conclusions. 

2 Contents of the WPT-03 Corpus 

The source of information for our study is a corpus of Web pages retrieved by the crawler 
of the tumba! search engine [5]. This snapshot of the Portuguese Web includes, for the 
most part, documents of types HTML and PDF, hosted in the .PT domain or written in 
Portuguese and hosted in the .COM, .NET, .ORG, or .TV domains. 



J. L. Vicedo et at. (Eds.): EsTAL 2004, LNAI 3230, pp. 384-394, 2004. 
(§ Springer- Verlag Berlin Heidelberg 2004 
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The data was harvested and processed using the components from the XMLBASE 
Web database software, which includes the crawler, a Web content analyzer and a repos- 
itory - see the project Web page at http : //xldb . di . f c . ul . pt/ index . php? 
page=XMLBase. 




Fig. 1. Overview of the XMLBase Framework 

WebCAT is the tool responsible for parsing and analyzing the Web contents [8]. 
Among other things, it performs document format conversion, text extraction, and meta- 
data extraction. Both the original documents and the corresponding “textual versions” 
are maintained in Versus, a data repository for Web information [3]. All statistics are 
based on the corpus formed by the text documents stored in Versus. 

The repository also contains meta-information about the documents, including for 
example the size, storage date, and language properties. Since there is no way of knowing 
the language in which the documents extracted from the Web were written, an automatic 
tool to perform this task had to be developed. This language “guessing” component 
is based on a well-known n-gram analysis algorithm [4], together with heuristics for 
handling Dublin Core meta-data (which may or not be available in the documents). In 
a controlled study, the algorithm presented a precision of about 91% in discriminating 
among 11 different languages [7]. 

A problem we faced concerns files in the PDF format - although most of the docu- 
ments can be converted into plain text, the conversion tool sometimes fails in extracting 
the text, producing garbage as output instead of terminating with an error. Filtering this 
situations can be very hard. We currently exclude most of these faulty documents using 
a simple filter, which looks at the first characters of the file. However, this is not a perfect 
solution and many “garbage” documents are still included in the corpus. 

Many of the presented statistics count “terms”. We adopted a definition of “term” 
similar to that given by the Berkeley elib project - see the corresponding Web page 
at http : / /elib . cs . berkeley . edu/docf req. According to it, terms are the 
sequences of the characters: 

- a-z, A-Z, 0-9 

- ASCII 150-160, 170, 181, 186, 192-214, 215-246, and 248-255 (U, U, Y, Z, Z, IJ, 
i, d, §, a, 1, 1, z. A, A, A, A, A, A, R, g, E, E, E, E, 1, 1, 1, 1, D, N, 6, 6, 6, O, O, 
CE, 0, U, U, U, U, Y, f>, SS, a, a, a, a, a, a, ae, §, e, e, e, e, i, f, i, i', 5, n, 6, 6, 6, 6, 6, 

0, u, u, u, ii, y, B). 
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All other characters are regarded as term breaks. We differ from this definition in the 
way we handle hyphens. It is considered as a valid character of a term, when the next 
character is one of a-z or A-Z, in order to account that hyphens are essential characters 
in Portuguese, whereas in English they are mere punctuation marks (note that we still 
consider them as punctuation marks if they are not immediately followed by an alphabetic 
character). The definition of “term” adopted in this study includes therefore all sequences 
of the following characters: 



- a-z, A-Z, 0-9; 

- ASCII 45 (= 

- ASCII 150-160, 170, 181, 186, 192-214, 215-246 and 248-255. 

3 Statistics of the WPT-03 Corpus 

3.1 Document Statistics 

The Portuguese Web snapshot analyzed in this study has 377561 1 documents, collected 
between the 21st of March 2003 and the 26th of lune 2003. Of these documents, about 
68.6% (2590641 documents) are written in Portuguese. 

Table 1 shows the average, median, and standard deviation of document sizes for 
WPT-03. Document size is measured in real size, text size and number of terms. Real 
size and text size are given in bytes, measuring the size of the document in the original 
format (HTML, PDF, etc.) and converted into plain text, respectively. 



Table 1. Document size statistics 





Real size 


Text size 


Number of terms 


Average 


24461 


2886 


438 


Median 


14672 


1336 


188 


Standard deviation 


54191 


8240 


1327 



Figure 2 shows the distribution of document sizes measured in the number of terms. 
As in other corpora, the number of small documents is much higher and we conjecture 
that the distribution is identical. The distribution naturally follows Zipf’s law [17, 11], 
as shown by the displayed trend-line. 

3.2 Term Statistics 

Number and Frequency of Distinct Terms. Table 2 gives the total number of terms, 
the number of distinct terms, and the average and median number of occurrences of each 
distinct term. In order to abstract from differences in capitalization, all characters were 
converted into lower case before computing these statistics. 

The document frequency for terms, i.e., the number of documents in which a cer- 
tain term appears (disregarding the number of occurrences in the document) is another 
important statistic. Since a substantial part of the documents are written in foreign lan- 
guages, it is interesting to get some statistics for the terms occurring only in documents 
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Fig. 2. Document sizes in terms per document 
Table 2. Number of terms 





All Pages 


Pages in Port, only 


Total number of terms 


1652645998 


1208036873 


Number of distinct terms 


7880609 


4066300 


Average number of occurrences 


210 


297 


Median number of occurrences 


2 


2 


Standard deviation (# of occur.) 


3428 


42247 


Average document frequency 


865 


128 


Median document frequency 


1 


1 


Standard deviation (doc. freq) 


4496 


5305 



written in Portuguese. We therefore computed the frequencies considering both the full 
corpus and only the pages written in Portuguese. 

Table 3 lists the 25 most frequent terms occurring in the corpus. Frequency is mea- 
sured both in terms of the total number of term occurrences and document frequency, 
respectively. Most terms occurring in this list are candidate stop words in information 
retrieval systems for the Portuguese language. 

Term Size. We analyzed the average number of characters per term, regarding all terms 
occurring in the corpora and regarding all distinct terms. Additionally, we give the 
median and standard deviation. Once again the analysis is two-fold, with respect to all 
documents in the corpus and restricted to documents written in Portuguese. Results are 
given in Table 4. 

Figure 3 shows the distribution of term size (regarding all terms of the corpora). 
Approximately 99% of the terms are shorter than 15 characters, and a major part of 
those longer than 15 characters are due to “garbage” in the corpus and the problem of 
extracting the text from PDF files mentioned above. 
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Table 3. Most frequent terms 



All docs 


Portuguese docs I 


Term 


Occ. 


Doc. 


Freq. 


Term 


Occ. 


Doc. 


Freq 


de 


58734369 


de 


2727182 


de 


55977484 


de 


2344461 


a 


35651699 


a 


2600458 


a 


29617180 


e 


2093982 


e 


27818162 


- 


2502955 


e 


26472070 


a 


2018658 


- 


22314054 


e 


2400583 


0 


21162843 


0 


1854189 


0 


21994175 


do 


2056158 


do 


16919378 


do 


1825455 


do 


17674236 


o 


2034963 


- 


15435398 


- 


1805399 


da 


15196359 


da 


1890699 


da 


14745024 


da 


1733306 


que 


14,659,562 


os 


1865796 


que 


1435160 


para 


1632962 


the 


14187020 


para 


1747633 


em 


10302921 


em 


1606919 


1 


12251543 


em 


1707146 


para 


9468453 


os 


1589019 


em 


10523210 


com 


1642535 


os 


8114119 


com 


1442380 


para 


9742012 


no 


1572418 


com 


7678022 


que 


1291286 


os 


8692680 


as 


1477125 


1 


7532463 


por 


1260588 


com 


8345476 


1 


1446037 


um 


6755990 


um 


1256344 


0 


8114739 


que 


1371220 


no 


6412220 


no 


1243250 


2 


8026747 


2 


1363302 


por 


5630947 


na 


1174801 


no 


7140563 


por 


1349638 


nao 


5534784 


as 


1123699 


um 


6845245 


um 


1294310 


as 


5383637 


dos 


1072318 


as 


6800040 


na 


1240599 


dos 


5363523 


uma 


1053919 


of 


6427217 


3 


1183339 


uma 


5339622 


nao 


1040530 


and 


6085323 


s 


1152840 


2 


5080441 


ao 


1040371 


to 


5934084 


dos 


1132724 


na 


5041565 


todos 


1037091 


por 


5825647 


Pt 


1106868 


e 


4630006 


1 


1007115 


nao 


5608615 


uma 


1102931 


se 


4351434 


OU 


950273 


dos 


5497639 


todos 


1090087 


ou 


4284627 


2 


944924 



E-mail Addresses, Numbers and Hyphen Statistics. To improve the search engine’s 
handling of queries, it was interesting for us to analyze the frequency of things like 
e-mail addresses or numeric terms. 

Numeric terms, as the name suggests, consist solely of numeric characters. As for 
e-mail addresses, they are of the form X@X.X, where X stands for a non-empty alphanu- 
meric sequence plus the characters and (See the Internet RFC822 - Standard for 
the format of ARPA Internet text messages). Although each e-mail address counts as 
several terms in all other statistics (the separators are seen as punctuation), here they are 
seen as atomic units. 

Finally, hyphenated words are terms where one character is a hyphen, as defined in 
Section 3.1. Counting hyphenated terms is important as, depending on their frequency, 
it may be more interesting for the search engine to consider them as separated sequences 
of terms. 

Table 5 shows the total number of occurrences, the average number of occurrences 
for each distinct term in the collection, and the average size (in number of characters) 
of e-mail addresses, numeric terms and terms containing hyphens. The weighted av- 
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Table 4. Term size 





All terms 


Distinct terms 




all docs 


only port. 


all docs 


only port. 


average 


4.8840 


4.9680 


8.9780 


8.7400 


standard deviation 


4.4762 


3.7930 


39.6400 


20.6860 


median 


4 


4 


7 


8 




1E-K)0 1E-K)1 1E-K)2 lE+03 lE+04 

Number of Characters 

Fig. 3. Term size in characters per term 

erage size also refers to the number of characters, but considering the number of all 
occurences, instead of only the distinct ones. 

Morphology of the Terms. In order to get an idea of the morphology of the terms 
occurring in the corpus, we used the jspell [16] morphologic analyzer. This allowed us 
to relate base forms of words to inflected variants, and find out which syntactic categories 
the terms in the corpus belong to. 

After excluding all terms containing numeric characters, we obtained 1884932 dis- 
tinct terms (regarding only the Portuguese documents). 429937 of these can be analyzed 
morphologically using jspell. To reduce ambiguity, we only accept a solution if the 
lemma resulting from undoing inflection is contained either in the WPT-03 corpus itself, 
or in the CetemPublico corpus (see Section 3.3). 

Of the 429937 terms that can be analyzed, 179778 (41.81%) are unambiguously 
analyzed as both nouns and adjectives, 137270 (31.93%) as verbs, 13932 (3.24%) as ad- 
jectives and 10322 (2.40%) as nouns. Furthermore, 71321 (16.59%) terms are ambiguous 
between verb and noun/adjective, 7117 (1.66%) between just noun and noun/adjective 
and 2342 (0.54%) between adjective and noun/adjective. 7855 (1.83 %) terms are am- 
biguous in other respects. 
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Table 5. Special terms 





E-mail addresses 


Numeric terms 


Terms with hyphens 


Number of All Occ. 


1264939 


146136400 


52499281 


Number of Dif. Occ. 


203638 


570406 


1510253 


Average of occ. 


6.21 


256.20 


34.76 


Average term size 


21.48 


7.44 


11.76 


Average term size (weighted) 


20.44 


2.13 


5.87 



In the future, we plan on using other tools to enhance the morphology analysis in 
WPT-03, such as the PALAVROSO morphologic analyzer [9] or a good parts-of-speech 
tagger trained for the Portuguese language [1,2]. 

3.3 Inter-corpora Statistics 

This Section provides statistics comparing WPT-03 against CetemPublico [13, 12]. In 
the future, we plan to cross WPT-03 with other available corpora of Portuguese text, 
giving a more extended analysis. 

CetemPublico. To measure the coverage of the dictionary used for spelling correction in 
tumba ! , we analyze the appearance of terms in the corpus that are contained in the spelling 
dictionary. As the dictionary contains all the terms that appear in the CetemPublico 
corpus, this statistic not only provides information about correctly spelled terms, but also 
about the overlap of the CetemPublico and the tumba! corpora. Note that the correction 
of the terms can not be 100% assured, as the CetemPublico corpus used to build the 
spelling dictionary contains itself errors. 

A substantial part of the terms differ only by the use of accents (i. e., replacing for 
example d by a). For that reason, in the statistics that compare these corpora, we provide 
on Tables 6 and 7 two result sets: one considering accented characters, and the other 
ignoring them. The meaning of each line on both tables is as follows: 

#WPT-03 terms in CP (distinct) indicates the number of distinct terms in the WPT- 
03 corpus that also occur in the CetemPublico corpus; the percentage represents 
how many of the distinct terms of WPT-03 also appear inside the CetemPublico 
corpus. 

# CP terms in WPT-03 (distinct) indicates the number of distinct terms in CetemPub- 

lico that also occur inside the WPT-03 corpus; this is the same number as above, but 
the percentage is slightly different. 

# WPT-03 terms in CP (total) indicates the total number of terms in the WPT-03 cor- 

pus that also occur in CetemPublico; the percentage represents how many of all the 
terms in WPT-03 also occur inside the CetemPublico corpus. 

# CP terms in WPT-03 (total) indicates the total number of terms in the CetemPublico 

corpus that also occur in WPT-03; the percentage represents how many of all the 
terms in CetemPublico also occur inside the WPT-03 corpus. 

Note that whereas almost all of the terms in CetemPublico also occur in the WPT- 
03 corpus, only 60% of the terms from WPT-03 appear in the CetemPublico corpus. 
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Table 6. Overlap with CetemPublico (counting all characters) 





All docs 


Only port. 


#WPT-03 terms in CP (distinct) 


153729 (1.95%) 


152641 (3.75%) 


# CP terms in WPT-03 (distinct) 


153729 (3.46%) 


152641 (3.43%) 


# WPT-03 terms in CP (total) 


4213578 (94.77%) 


4212486 (94.74%) 


# CP terms in WPT-03 (total) 


984033934 (59.54%) 


897616340 (74.30%) 



Table 7. Overlap with CetemPublico (ignoring accentuated characters) 





All docs 


Only Port. 


#WPT-03 terms in CP (distinct) 


150157 (1.91%) 


148913 (3.66%) 


# CP terms in WPT-03 (distinct) 


150157 (38.89%) 


148913 (38.57%) 


# WPT-03 terms in CP (total) 


4221291 (94.94%) 


4219958 (94.91%) 


# CP terms in WPT-03 (total) 


1003575179 (60.73%) 


907834378 (54.93%) 



This is, at least partly, due to the amount of documents written in languages other than 
Portuguese, and also to CetemPublico being much “cleaner”, i.e., it contains less terms 
including numeric characters or “garbage” text. Previous studies have already indicated 
that while Web corpora have advantages in quantity (more “live” language information, 
more words and case-frames that newspaper corpus, etc.), they are usually noisier [14]. 

Postal Codes. Having an idea of the amount of geographic entities that are present in the 
WPT-03 corpus would be very interesting for us in the context of project tumba ! . We used 
a list of Portuguese postal codes to find out which and how many “geographic” names 
appear in the text. The list is provided by CTT (Portuguese Post Office) and can be down- 
loaded from http : / /codigopostal . ctt . pt/pdcp- files/ todos_cp . zip. 
It contains not only postal codes, but also city, street and district names (277980 names 
of geographic entities overall). 



Table 8. Postal Codes found in tumba! and in CTT list 





All Postal Codes 


Distinct Postal Codes 


WPT-03 


683458 


33799 


CTT 


236924 


170549 



Table 9. Statistics of Postal Code occurrences 





All Docs 


CTT Postal codes in WPT-03 (distinct) 


27695 (81,94%) 


WPT-03 Postal Codes in CTT (distinct) 


27695 (16,24%) 


CTT Postal codes in WPT-03 (total) 


567326 (83,01%) 


WPT-03 Postal Codes in CTT (total) 


60885 (25,70%) 


Average of occ. 


2,20 
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In the analysis, we considered all terms in the form XXXX-XXX as postal codes, 
with X being a numeric character. Tables 3.3 and 3.3 show the statistics for postal codes 
occurrences. Near 1/6 of all Portuguese Postal Codes appear in the WPT-03. We can 
speculate that these are the Postal Codes for areas where many business and commercial 
entities are located. The amount of CTT postal codes in the WPT-03 that also occur in 
the CTT database should be 100%, but 17-19% of the Postal Codes in WPT-03 are infact 
invalid. 

Geographic Entities. To have an idea of the richness of WPT-03 on geographical ref- 
erences, we searched the corpus for such Information. For this purpose, we did a case- 
insensitive search on the 308 Portuguese municipalities. 

As many Portuguese geographic names consist of more than one word, we need to 
group individual terms, in order to provide statistics on the geographic entities identified 
in the corpus. To locate these entities, we use a simple algorithm that looks at all matches 
of those geographic names in the “word-grams” from WPT-03. 

The total number of geographic entities identified in fhe corpus using this method is 
8147120. The ten most frequent are given in Table 10, along with the overall number of 
occurrences. 



Table 10. Most Frequent geographic names 



Geographic Name 


Number of Occurrences 


lisboa 


1034268 


porto 


651108 


Coimbra 


307881 


guarda 


198436 


aveiro 


192804 


braga 


186410 


almeida 


142591 


leiria 


121280 


faro 


111028 



This was only a crude approach to measure the amount of geographic references, and 
the results are not conclusive. For instance many Portuguese proper names (especially 
people’s names) are also geographic names, and they were identified in fhis sfudy as 
geographic references. In fhe fufure, we plan on conducting a much more accurate 
analysis of the occurrence of this information on WPT-03, using specific soffware for 
accurate named entity recognition. 

4 Conclusions 

We used the tumba! repository to construct a textual corpus from the pages of the “Por- 
tuguese Web”, denominated WPT-03. The corpus was then analyzed using common 
statistical techniques from corpus linguistics [10]. 

This study was motivated by our interest in hnding more about the textual contents 
of the tumba! repository, including both information about the documents (their size. 





A Statistical Study of the WPT-03 Corpus 393 



language, etc.) and the terms contained in the documents. With this data we can bet- 
ter model the capacity and the algorithms of the tumba! search engine, and in tandem 
provide insights to a large corpus in natural language that are of interest to other re- 
searchers. 

Specially interesting is comparing WPT-03 with several other corpora made available 
through Linguateca. WPT-03 contains the more or less colloquial language found on Web 
pages, whereas the Linguateca corpora have mostly the more formal language found in 
newspaper articles. 

It will also be interesting to repeat this study regularly and track the evolution of the 
data - most probably the most frequent terms (apart from function words, of course) 
will change over time. The study could also be carried out using different sub-corpora, 
in order to find differences and similarities for different Web “communities”. 

Finally, a complementary study on the logs for the queries submitted to tumba! would 
also be very useful, in order to understand the way Portuguese users search for infor- 
mation on the Web, and if the information they are looking for is widely available or 
not. 
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Abstract. This paper describes two procedures for generating very short sum- 
maries for documents from the DUC-2003 competition: a chunk extraction method 
based on syntactic dependences, and a simple keyword-based extraction. We ex- 
plore different techniques for extraction and weighting chunks from the texts, and 
we draw conclusions on the evaluation metric used and the kind of features that 
are more useful. Two preliminary versions of this procedure ranked in the 12^^ 
and 13‘^ positions with respect to unigram recall (ROUGE-1) at DUC-2004 (out 
of 39 runs submitted). 



1 Introduction 

Headline generation is the problem of generating a very short summary of a docu- 
ment, which condenses the main ideas discussed in it. This paper describes two different 
procedures tested on the collection provided for the 2003 Document Understanding 
Conference (DUC-2003), and a hybrid approach that combines them. In all the exper- 
iments described here, we can identify two separate steps: hrstly, an identihcation and 
extraction of the most important sentences from the document. Secondly, the extraction 
of relevant keywords and phrases from those sentences. The purpose of the first step is 
to restrict as much as possible the search space for the second step, thereby simplifying 
the selection of fragments for the headline. We have evaluated the procedures using the 
ROUCE-1 score [1], and we also explore some of the characteristics of this metric. The 
words summary and headline will be used indistinctly throughout this paper. 

A popular approach for generating headlines consists in first identifying the most rel- 
evant sentences, and then applying a compaction procedure. Sentence-extraction proce- 
dures are already well-studied [2], so we shall focus on the differences in the compaction 
step. Some of the techniques are (a) deletion of all subordinate clauses; (b) deletion of 
stopwords (determiners, auxiliary verbs, etc.) [3-5]; (c) extracting fragments from the 
sentences using syntactic information, e.g. the verb and some kind of arguments, such as 
subject, objects or negative particles [6-9]; and (d) using pre-defined templates [10]. A 



* This work has been sponsored by CICYT, project number TIC2001-0685-C02-01. 



J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 395M06, 2004. 
© Springer- Verlag Berlin Heidelberg 2004 




396 



E. Alfonseca et al. 



different approach consists in extracting, from the document, a list of topical keywords, 
collocations, noun phrases [3, 11-13]. Using this procedure, the resulting headline will 
not be grammatical, but it may provide a useful description of the topic of the article. 

In Section 2 we describe the experimental settings for evaluating the system, and 
Section 3 briefly summarises the general architecture of the system. Next, Section 4 
describes the procedures for sentence selection, and Sections 5, 6 and 7 describe all the 
experiments performed for generating headlines. Finally, Section 8 describes the conclu- 
sions we can draw from the results obtained, and discusses possible lines for future work. 



2 Experimental Settings 

The purpose of the work is to generate very short headlines from documents. We can 
describe this task using Mani’s classification of automatic summarisation systems [2], 
which takes into account the following characteristics: compression rate is typically 
very high (a few words or bytes); the audience is generic, as the headlines do not depend 
on the user; the function is indicative, as it must suggest the contents of the original 
document without giving away details; and they should be coherent, and generated from 
single documents. In the experiments, the genre used is newswire articles, written in a 
single language (English). 

All the experiments have been tested on the data provided for task 1 in DUC-2003. 
It is a set of 624 documents, grouped in sixty collections about some topics, such as 
schizophrenia or floodings in the Yangtze river. For each of the documents, NIST has 
provided four hand-written summaries to be used as gold standard. Throughout this 
work, we use a 75 -byte limit, but we apply it in a lenient way: if a word or a chunk 
selected for a summary exceeds the limit, it will not be truncated. 

2.1 ROUGE as Evaluation Metric 

ROUGE [1, 14] is a method to automatically rank summaries by comparing them to other 
summaries written by humans. The original idea for the ROUGE-N metric is basically 
an n-gram recall metric, which calculates the percentage of n-grams from the reference 
summaries appear in the candidate summary: 

J2seRefsT.gramr.eS : gramnCCand}] 

J2seRefs J2gramneS \{gram^}\ 

Note that if an n-gram appears in several references at the same time, it is counted 
as many times, which makes sense because an n-gram for which there is consensus 
between the humans should receive a higher weight. The procedure has been extended 
with additional calculations in order to improve its accuracy [14]. 

Lin and Hovy’s experiments [1] indicate that ROUGE-1 correlates well with human 
evaluations of automatic headlines. In fact, given the availability of four hand-written 
summaries for each document, ROUGE has been used for evaluating the summaries 
produced by the participant teams in DUC-2004. Therefore, we have chosen to evaluate 
our system with the ROUGE- 1 metric. 
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Table 1. Example of the procedure for finding the combination of words for which the unigram 
recall is maximised 



Words 


W1 


Document 1 

W2 W3 W4 Ws 


W6 


Words 


Wi 


Genotype 

W2 Ws W4 W5 Wq 


Score 


Model 1 


1 


0 


1 


1 


0 


0 


Candidate 1 


1 


0 


1 


1 


0 


1 


0.8571 


Model 2 


0 


0 


1 


1 


1 


1 


Candidate 2 


1 


0 


1 


0 


0 


1 


0.5714 


Model 3 


0 


0 


1 


1 


0 


0 


Candidate 3 


1 


0 


0 


0 


0 


1 


0.3571 


Model 4 


2 


1 


0 


1 


0 


1 


Candidate 4 


0 


0 


0 


0 


1 


0 


0.0714 


Frequency 


3 


1 


3 


4 


1 


2 


Candidate 5 


1 


1 


1 


1 


1 


1 


0 



2.2 Upper Bound of ROUGE-1 Score in 75-Bytes Summaries 

Before using ROUGE- 1 to evaluate the summaries, it would be interesting to discover 
which is the range of scores that a system can obtain in this particular task. ROUGE- 1 
has been used to rank existing summarisation systems in the DUC-2004 competition. 
Although we know the score obtained by human summarisers, between 0.25017 and 

0.31478 in DUC-2004 for 75-byte summaries, to our knowledge, we are not aware of 
the highest score that can possibly be obtained with this score. 

In a first experiment, we study which is the range of values that can be obtained using 
ROUGE- 1 when comparing a candidate summary to four manual headlines. 

We shall use, as in DUG, four reference summaries for each documents. When 
evaluating a candidate headline, ROUGE- 1 can be considered as the unigram recall of 
the candidate. If the candidate and all the references have the same length, it is obvious 
that, unless all the references have the same words, a candidate summary will never 
contain every word from every reference (which would mean a recall of 1). 

The experiment for discovering the highest possible score has been designed in the 
following way: 

A. For each document, 

1. Take its four hand- written headlines. 

2. Collect all the words that appear in them, excluding closed-class words. 

3. Count the frequency of each word. 

4. Look for the combination of words that maximises the ROUGE- 1 score and 
has less than 75 bytes altogether. 

B. Calculate the average of this score for all the documents. 

Step 4 is the most costly step. A good approximation can be obtained by choosing the 
words with the highest frequencies in the model summaries. Still, that does not guarantee 
that the obtained summary will be the best one, as it may be better to substitute a long word 
with a large frequency for two short words which altogether have a higher frequency. A 
brute force approach would require too much computational time, and therefore we opted 
for a genetic algorithm to find the combination of words that maximises the unigram 
recall. Table 1 illustrates how the search is performed: 

- The upper part of the table represents the reference headlines (the models) for a 
couple of documents, and the frequency of each word in each model. For instance, 
the fourth model contains twice, and W 2 , W 4 and wq once. 
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find babies may be more schizophrenia possibly brain development dopamine 
cesarean babies be more schizophrenia possibly brain development dopamine 

1 . Researchers find Cesarean babies may be more susceptible to schizophrenia. 

2. Natural childbirth possibly instrumental in brain development. Cesareans associated with schizophrenia 

3. Canadian rat research links caesarean birth with schizophrenic dopamine reactions. 

4. Cesarean, babies, susceptible, schizophrenia, Boksa, El-Khodor, dopamine, amphetamines, brain, development 



Fig. 1. Two of the best scoring summaries for a document from collection dlOO (DUC-2003 data), 
and the four gold-standard headlines 



- The next line, labelled Frequency, contains the sum of frequencies of each word in 
all the models, for each of the documents. 

- We encode a candidate summary, shown below in the table, as a boolean vector of 
a length equal to the total number of words in all the models. 

- The fitness function for the genetic algorithm is 0 if any summary has more than 75 
bytes (e.g. Candidate 5), and the ROUGE- 1 metric otherwise (all the others). 

- The genetic algorithm evolves a population of boolean vectors, using the muta- 
tion and crossover operators, until for a large number of generations there is no 
improvement in the fitness of the population. 

The summaries obtained with this procedure are simply a list of keywords. Figure 1 
shows a couple of key word choices that produce the best ROUGE- 1 scores for a document 
in collection dlOO, and the four gold-standards used. 

The best choice of keywords for the 624 documents in the data set has produced a 
mean ROUGE-1 score of 0.48735. Therefore, we may take it as the upper bound that 
can be obtained using this evaluation procedure with this data collection. This result 
is consistent to the evaluation done in DUC-2004, where the test set is very similar: 
newswire articles and four manual summaries for each one. In this case, all the human- 
made models have received a ROUGE-1 score between 0.25017 and 0.31478, which 
represents nearly 65% of the upper bound. Constraints such as grammaticality and the 
fact that the same idea can be expressed in many ways probably make it difficult to reach 
a higher score. 

3 Our Approach 

Our system has been divided into two main steps: 

1. Firstly, we select a few sentences from the document, so that there is much less 
information from where to extract the headline. 

2. Secondly, from those sentences, we extract and rank either keywords or phrases. 

3. The headline is finally built by concatenating the keywords or chunks extracted in 
the previous step, until we reach the length limit. As said before, if the last keyword 
exceeds the limit, we do not truncate the summaries. 



The following three sections further elaborate these steps. 
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Fig. 2. ROUGE- 1 results for a different number of sentences selected 



4 Selection of Sentences 

The first step of our system is the selection of the most relevant sentences. This is done 
to reduce the search space for finding the best chunks of texts with which to construct 
the headline. The sentence-extraction procedure we use is based on the Edmundsonian 
paradigm: as a linear combination of the value of several features. The features used are, 
among others, the sentence length, the position of the sentence in the paragraph, the po- 
sition of the paragraph in the document, and the word overlapping between the sentences 
selected for the summary. Although some related work indicates that just by choosing the 
first sentences from the document can be equally useful for headline extraction [9, 15], 
we have opted to continue using this paradigm. 

In previous work we described a procedure for acquiring the weights for the linear 
combination [7, 16]. It uses genetic algorithms, in a way which is very similar to the 
procedure used in the previous section: for each possible set of weights, we calculate 
the summary produced from those weights and evaluate it against the model headlines. 
The unigram recall of the summary is the fitness function of the set of weights. Finally, 
we keep the weights which select the summaries that maximise the unigram recall. The 
use of genetic algorithms for summarisation had been used previously by Jaoua and Ben 
Hamadou [17]. 

The hypothesis that is the basis for every sentence-extraction procedure is that there 
are a few sentences which hold the most relevant information, and a large number of 
sentences which elaborate those main ideas. Figure 2 shows the ROUGE-1 score of a 
summary in function of the number of sentences selected. As can be seen, with just the 
top three sentences, the ROUGE- 1 score reaches around 0.60, using the same reference 
headlines as in the previous experiment. From that point onward, the slope of the curve 
slows down. The maximum score attained is around 0.87, when the complete documents 
are selected. 

This indicates that just a few sentences have enough information for generating the 
summaries. The next step will be to reduce those sentences to no more than 75 bytes, 
trying to keep the ROUGE- 1 score as near 0.35 as possible. 
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Table 2. Sentences selected from document APW1998 1106.0542, and verb phrases obtained from 
them 



Sentences 

l^ortugal and Indonesia are mulling a new proposal to settle their dispute over bast limor, and talks between the 
two sides are at a crucial stage, according to a U.N. envoy. 

Envoy Jamsheed Marker said late Thursday that the U.N. proposal envisages broad autonomy for the disputed 
Southeast Asian territory. 

We’ve reached a very important, and I might even say critical, moment in the talks,”, Marker told reporters after 



a dinner with President Jorge Sampaio. 

Verb phrases Filter 

IPortugal and Indonesia] are mulling [a new proposal] none, 

to settle [their dispute over East Timor] none, 

are 1 

[Envoy Jamsheed Marker] said [autonomy] 2 

[We] ’ve reached [a very important] 3 

[I] might even say 2 

[Marker] told [reporters] 2 



5 Chunk-Based Headline Extraction 

Most newswire articles describe events, which are usually (but not always) expressed 
with verbs. Therefore, we thought that a good idea for generating the headline was to 
select the most relevant verbs from the selected sentences, together with their arguments. 
The process is divided in three steps: verb extraction, verb-phrase ranking and headline 
generation. 

To this aim, we process all the documents using a syntax analyser. The parser used is 
the wraetlic tools [18]'. These include a Java PoS tagger based on TnT [19], a stemmer 
based on the LaSIE morphological analyser, [20], three chunkers written in C-H- and 
Java (with ~94.5% accuracy when evaluated on the WSJ corpus), and a subject-verb 
and verb-object detector, written in Java ad hoc with hand-crafted rules. Multiword 
expressions have also been identihed automatically with the following procedure: (a) 
eliminate stop-words and verbs from the text; (b) Collect bigrams and trigrams whose 
frequency is above a threshold (e.g. three times); (c) put again stopwords where necessary 
(e.g. in “President of the United States”). All the experiments reported in this section 
were evaluated on the whole DUC-2003 corpus for Task 1 (headline generation). 

5.1 Verb-Phrase Extraction 

Verbs are extracted in the following way: using our PoS tagger, we first obtain all the verbs 
from the document. With the partial parser, we markup each verb with its subject and 
arguments. Table 2 shows the sentences obtained from a document, and the verb phrases 
extracted from them. Note that the parser is not perfect: sometimes it cannot identify the 
arguments of a verb. Some errors are due to a poor grammar coverage, and others are due 
to mistakes in the PoS tagging. However, in many cases the arguments found are correct. 

Filtering Fleuristics. A manual revision of the verb phrases showed that many of them 
contained information that most probably was not relevant enough as to be included in 
the headline. Some of these cases include the following: 



* Available at www.ii.uam.esr ealfon/eng/download.html 
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Table 3. ROUGE- 1 score obtained (a) by selecting sentences from the document; (b) by extracting 
the verb phrases (with their arguments) from them; and (c) after filtering 



No. of sentences 


All sentences 
Mean length ROUGE-1 


All VPs 

Mean length ROUGE-1 


Kept VPs 

Mean length ROUGE-1 


1 


198 


0.44155 


113 


0.24635 


73 


0.20851 


2 


364 


0.53025 


194 


0.30725 


102 


0.23756 


3 


527 


0.59137 


277 


0.35974 


129 


0.26616 


4 


723 


0.63480 


333 


0.38579 


145 


0.28082 



1 . If the parser has not been able to identify any argument of the verb phrase. 

2. If the verb, in WordNet, is a hyponym of communicate, utter, or judge, because in 
many cases the information that was communicated is more relevant than the name 
of the person who stated it. 

3. If the subject is in first or second person. 

4. If the verb is either in passive form or a past participle, and the parser did not find its 
subject nor its agent. This is because in most of these cases the verb is functioning 
as an adjective and it was wrongly tagged by the PoS tagger. 

The right column of Table 2 shows, for each verb, either the number of the filter 
because of which it was ruled out, or none, in the case that it passed all filters. As can 
be seen, only two verb phrases from this document have passed the set of filters. 

Information Kept in the VPs. By extracting the verbs, the sentences selected in the 
previous step are reduced to a small set of verbs and their arguments. After applying the 
filters, this set is reduced even further, as in the example above. In Figure 2 we studied 
the amount of information, expressed with the ROUGE- 1 score, that we still kept by 
selecting just two or three sentences from a document. We can do the same experiment 
now to see how large is the decrease in the ROUGE-1 score if we substitute the selected 
sentences with the list of verb phrases, and if we substitute this list with just the verb 
phrases that have passed all the filters. 

The results are shown in Table 3. The first column shows the number of sentences that 
have been selected from the original document. The second column shows the ROUGE- 
1 score if the summary contains the complete sentences selected. These are the same 
values used for plotting Eigure 2. The third column shows the ROUGE-1 score if we 
score not the complete sentences, but the list of verbs and their arguments, as printed 
in Table 2. Einally, the fourth column shows the score if we list the verb phrases after 
applying all the filters. It can be seen that the score decreases in the last two columns. 
However, the decrease is not proportional to the compaction level performed in each of 
these steps. Therefore, we know that we are removing mostly words that do not appear 
in the reference summaries. 

5.2 Verb-Phrase Ranking 

We have now extracted a list of verbs from the selected sentences. In order to generate 
the headline, we would like to rank them according to some metric of relevance. Lin and 
Hovy [21] describe how topic signatures can be built automatically from collections, and 
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shows an application to text summarisation. We have followed the same approach, but 
we have calculated the topic signatures both for collections and for single documents. 
The procedure is the following: 

1 . Collect all the words and their frequencies for each document or collection. 

2. Collect all the words from the rest of the documents or collections. Consider this as 
the contrast set. 

3. Calculate the weight of each of the words using a weight function. 

There are several parameters whose values we can vary. There are many weight 
functions that we can apply. We have tried with the likelihood ratio [22], which was 
the one used by Lin and Hovy [21]; and the tf.idf, Mutual Information and t-score 
metrics. Furthermore, as indicated above, the signatures may be calculated either for 
each document separately, or for each collection. 

With this procedure, for all the words in each document or collection, we can calculate 
their weight, which is a measure of the relevance of that word in the scope of the document 
or collection. The verb phrases that we had extracted and filtered can be weighted using 
the values from the topic signatures: each verb phrase may receive as weight the sum of 
the weights of all the words in the verb and its arguments. 

5.3 Headline Generation and Results 

To generate the headline, while the summary has a length lower than 75 bytes, we add the 
next verb phrase with the highest score . To keep the grammaticality, we do not truncate 
the summaries if they exceed the limit in a few bytes. In the example above, only two 
verb phrases remain after the filtering: 

[Portugal and Indonesia] are mulling [a new proposal] 
to settle [their dispute over East Timor] 

These will be weighted and ranked next using a topic signature. The headline will 
be generated in the following way: 

1 . Firstly, the system chooses the most weighty verb phrases until their total length 
limit exceeds 75 bytes. Note that, if the limit is exceeded by a few bytes, we do not 
truncate the summaries, so as to keep them grammatical. In this example, both VPs 
will be selected. 

2. Secondly, they will be put together in the order in which they appeared in the original 
document. If there was any conjunction linking them, it will be added to the summary 
so as to improve the readability. 

The resulting summary in the example will be: 

Portugal and Indonesia are mulling a new proposal to settle their dispute over 
East Timor. 

We have evaluated the ten different configurations (choice of weight function and 
topic signatures for either documents and collections) using the ROUGE- 1 score. We 
saw, at the beginning, that by choosing just three sentences from the original document 
we could reach a very high ROUGE- 1 score, and the slope of the curve in Figure 2 slows 
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Table 4. (a) Effect of using a different weight function for calculating the topic signatures in the 
final ROUGE- 1 score, (b) Effect of selecting a different number of sentences for each document 
or for each collection in the final ROUGE- 1 score 



Function 


Docs Cols. 


Likelihood ratio 


0.18298 0.19726 


tf.idf 


0.18910 0.19231 




0.19753 0.20105 


Mutual Information 0.18458 0.17933 


t-score 


0.18599 0.19011 



No. of sentences ROUGE- 1 score 

1 0.19363 

(b) 2 0.19956 

3 0.20105 

4 0.19972 



down from that point onward, this experiment was done using three sentences from each 
document. 

Table 4(a) shows the results. The best score is the one obtained with the function 
and for collections; while the likelihood ratio for collections has attained the second best 
score (not statistically significant). Apart from tf.idf, the other weight functions have 
lower results, statistically significant at 0.95 confidence. 

It can be observed here that the likelihood ratio needs a larger set of examples, because 
it is the one that scored worse if the signatures are obtained for single documents, but 
reaches the second place if we consider documents. Furthermore, when we calculate 
the signatures for each collection the results are slightly better than when we select the 
words that are more representative just for each document. 

In order to check whether the choice of selecting three sentences at the beginning of 
the process was correct, we tried yet another experiment. We may think that the more 
sentences selected, the more verb phrases we have for generating the headline. On the 
other hand, if we have too many verb phrases, it will be more difficult to identify those 
that are more informative. Table 4(b) shows the results obtained by selecting a different 
number of sentences from each document in the sentence extraction step. This result also 
suggests that three sentences is a good choice, although the difference is not statistically 
significant. 

Finally, Figure 3 shows the headlines obtained for the documents in the collection 
about East Timor. It can be seen that most of them are grammatical and easy to read, 
although the context of the news does not appear in the headlines, so it is not possible 
to know for most of them that the events occurred in East Timor. 



Indonesia ’s National Commission will investigate accusations. 

the documents show a total, of Indonesian troops assigned the number. 

to extradite former President Suharto; Suharto be extradited. 

Stray bullets killed two villagers and police. 

Habibie put an end. 

Rebels were holding two soldiers; Three soldiers and one activist were killed. 

Jakarta does not let six East Timorese asylum-seekers. 

Portugal and Indonesia are mulling a new proposal; to settle their dispute over East Timor, 
who to break the long-standing deadlock over East Timor 
Assailants killed three soldiers and a civilian. 



Fig. 3. Summaries generated for the documents in the collection about East Timor 
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6 Keyword-Based Headline Extraction 

After selecting the most relevant sentences from a document, at the beginning of the 
process, we can follow a completely different approach for headline generation which 
consists in extracting keywords about the topic discussed in the document. These head- 
lines will not be grammatical, but they may also be informative for a reader. 

In our experiments, the keyword-based headlines have been generated in six ways: 

1 . By collecting the highest frequency words in the document. 

2. By collecting the highest frequency words in the collection. 

3. By alternating high frequency words from the document and its collection, i.e. we 
start with the highest-frequency word in the collection; next the highest-frequency 
word in the document, next the 2nd. highest-frequency word in the collection, etc. 

4. By collecting the words with the highest weights in the document. 

5. By collecting the words with the highest weights in the collection. 

6. By alternating the highest weight words from the document and from its collection. 

Table 5 shows the results obtained with each approach. As can be seen, using just 
the collections is not very useful, because all the documents from the same collection 
will have the same headline. As expected, the best results have been obtained with a 
combination of the words with the highest frequencies or weights from the document and 
its collection. In general, we can see that using weights is better than using frequencies. 
Finally, Figure 4 shows some headlines obtained for the collection about East Timor. 
From them, a reader can guess the topic of the document, but it is still difficult to 
grasp the main idea. However, the ROUGE- 1 score is surprisingly high (approaching 
0.30). 



Table 5. ROUGE- 1 score for the several keyword selection strategies 



Setting no. Doc. Col. 


ROUGE-1 score 


1 


freq - 


0.25997 


2 


freq 


0.21689 


3 


freq freq 


0.27255 


4 


wei. - 


0.26810 


5 


wei. 


0.20724 


6 


wei. wei. 


0.29643 



pro-independence, troops, Portuguese, timorese, Indonesian, timor, east, carrascalao 
territory, timorese, Portuguese, autonomy, indonesian, timor, east, document 
affair, timorese, portugal, timor, indonesian, Portuguese, east, extradite 
marker, Jakarta, pro-independence, timorese, Portuguese, be, east, indonesian 
timorese, protester, Portuguese, activist, indonesian, timor, east, habibie 



Fig. 4. Keyword-based summaries generated for the first five documents in the collection about 
East Timor 
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timorese, east, timer, Indonesia ’s National Commission will investigate accusations. 

timer, the documents show a total, of Indonesian troops assigned the number. 

pertuguese, east, timer, to extradite former President Suharto; Suharto be extradited. 

indenesian, east, timer, he, said Yacob Hamzah , a lawyer; is a Muslim province. 

greup, Portuguese, activist, indenesian, east, timer, xanana, Habibie put an end. 

timer. Rebels were holding two soldiers; Three soldiers and one activist were killed. 

pro-independence, indenesian, Jakarta does not let six East Timorese asylum-seekers. 

marker, Portugal and Indonesia are mulling a new proposal; to settle their dispute over East Timor. 

have, say, indenesian, who to break the long-standing deadlock over East Timor. 

Portuguese, east, timer, indenesian. Assailants killed three soldiers and a civilian. 



Fig. 5. Summaries generated for the documents in the collection about East Timor 



7 Mixed Approach 

We have seen that a keyword-based approach produces headlines difficult to read, but 
which are highly informative, as they receive very high ROUGE- 1 scores. On the other 
hand, using the verb phrases produces grammatical headlines, but it is difficult to place 
the contents of the headline in context, as most of the times there are no topical keywords. 
A mixed approach can combine the strongest points of both. 

Our final approach consists of generating the headlines from the verb phrases of the 
documents, weighted with the weight function. Most of these summaries have far 
less than 75 bytes, so we can complete them with other information in the following 
way: 

- While the length of the summary is lower than 75, 

• Add the next word, from the doenment and the collection alternatively. 

Furthermore, to check the impact of the keywords, we always add at least one keyword 
to the VP-based summary. When tested on the DUC-2003 data, this configuration attains 
a ROUGE- 1 score of 0.28270, which is a large improvement from the highest score 
obtained by the verb phrases alone (0.20105). It is lower than the best mark obtained 
with only keywords, bnt the headlines are easier to read as a large part of the headline 
is formed with complete sentences. The snmmaries obtained for the collection on East 
Timor are shown in Table 5. 

8 Conclusions and Future Work 

We have developed a method for headline generation that combines a verb-phrase ex- 
traction phase and a keyword-extraction procedure. In general, we have observed that 
the keywords can be very usefnl for identifying the topic of the text. In fact, the addition 
of a few keywords boost the ROUGE- 1 score from around 0.20 up to around 0.28. On 
the other hand, the verb phrases not only nsnally provide the main idea of the doenment, 
but also give the headline a more natnral and readable shape than headlines formed of 
just keywords. 

Another conclnsion that we can draw from the resnlts is that it is equally important 
to study both the separate documents alone, and the documents inside their collection. 
A combination of the topic signatures from the documents and from their collections is 
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the one that has produced the best results. It can be argued that having the documents 
organised in collections is not natural. However, if we have a single document, this 
problem can in theory be overcome by automatically downloading similar documents 
from the Internet, or by clustering them to form automatically the collections. Future 
work includes a deeper understanding of the meaning of ROUGE- 1 and other possible 
metrics for evaluating automatic summaries. 
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Abstract. This work is about derivational suffixes, endings and prefixes of the 
Spanish language, which are useful for the establishment of about 70000 suf- 
fixal and 1 1 000 prefixal extended morpholexical relationships deduced from a 
corpus of 134109 canonical forms. A computational tool is developed capable 
of solving and answering to any morphological aspect of a Spanish word. The 
tool encompasses everything related with derivation, prefixation and other 
nearby aspects. It allows the recognition, the generation and the manipulation 
of morpholexical relationships of any word and of its related words, includes 
the recovery of all its lexicogenetical information until arriving at a primitive, 
the management and the control of the affixes in the treatment of its relation- 
ships, as well as the regularity in the established relationship. 



1 Introduction 

This work aims at obtaining a set of extended morpholexical relationships between 
Spanish words useful for automatic applications in natural language processing — in a 
synchronous study with automation on mind, formal or theoretical aspects may do not 
coincide with those strictly linguistic. There are Spanish words maintaining a strong 
functional and semantic relationship — the same appearing at the derivational or pre- 
fixal level — , that can not be taken as derivation or prefixation, although there is a 
formal relationship through other stages in the evolution of the languages, so it is 
indeed considered necessary to include them — agua with acuoso, vejiga with vesical, 
conejo with cunicular. This concept must be restricted to avoid arriving to the con- 
cept of related idea — which exceeds the objectives of this work, bianco with album, 
solido with endurecer, niho with pueril — , therefore a historic-etymological meeting 
criterion is applied. It is obvious that for the speaker acuario, portuario and cam- 
panario are all places equally related with agua, puerto and campana — it must also 
be so for the automatic data processing — ; in order to solve the linguistic boundaries 
preventing to treat relationships beyond the strict derivation or prefixation, it is neces- 
sary to be located at a different level from the morphological level; thus the concept 
of morpholexical relationship is extended. 

J. L. Vicedo et al. (Eds.): EsTAL 2004, LNAI 3230, pp. 407^18, 2004. 
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2 Lexicon 

The corpus handled in this work has been created from: the Diccionario de la Lengua 
Espanola (DRAE), the Diccionario General de la Lengua Espanola (VOX), the Dic- 
cionario de Uso del Espanol (Marfa Moliner), the Gran Diccionario de la Lengua 
Espanola (LAROUSSE), the Diccionario de Uso del Espanol Actual (Clave SM), the 
Diccionario de Sinonimos y Antonimos (Espasa Calpe), the Diccionario Ideologico de 
la Lengua Espanola (Julio Casares) and the Diccionario de Voces de Uso Actual 
(Manuel Alvar Ezquerra). 

A canonical form is defined as any word with its own identity susceptible of en- 
during derivational processes to form other words. Such a word could be formed from 
another by similar processes. In the reference corpus a canonical form is any entry 
word of consulted sources having own meaning — those entries that are appreciative 
forms of others and do not add any substantial meaning variation are discarded. The 
universe of words analyzed in this work is composed of 148798 canonical forms. 

3 Suffixal Relationships 

Spanish derivational processes consist fundamentally of suffixal modifications and 
usually, but not always, it implies a change of the derivative grammatical category 
with respect to its primitive canonical form. 

Erequently, there are pairs of primitive forms in Spanish coming from a same 
mother tongue in which they went through a derivational process. The current rela- 
tionship between the members of such pairs is considered as suffixal alteration, since 
in the current state of the Spanish language the existing relationship between them 
presents a strong parallelism with the derivational processes between Spanish forms, 
in its morphological aspects as well as in the semantic and grammatical aspects: for 
the verb multiplicar and the adjective multiplicable — both Spanish primitive words 
derived directly from the Latin — it is commendable to consider multiplicable as a 
deverbal adjetivation of multiplicar. 

There are many words in Spanish which come from a same mother tongue, where 
they suffered a derivational process from a common element that was never consoli- 
date in the current Spanish; thus, they are etymologically related and are considered 
derivatives from a non-existent Spanish form, since they show a high similarity in the 
morphological, semantic and grammatical aspects with analogous Spanish derivatives 
from an existing form: between the forms concupiscente and concupiscible — both 
Spanish primitive words derived directly from the Latin — results feasible to consider 
an analogous relationships to those between any pair of Spanish deverbal adjetiva- 
tions formed from a common primitive with the same endings, like dirigente and 
dirigible from dirigir. 

Other important aspect about the formation of Spanish words is to decide what 
words must be considered as primitives and what as derivatives — temporary line of 
appearance of the words must be maintained. The main difficulty appears when there 
is ambiguity between two morpholexically related primitives about their diachronic 
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formation; in such case, and for synchronous treatment of the Spanish purpose, the 
verb, if any, is designed as primitive; when there is not a verb involved, the primitive 
is selected using the Spanish word formation rules supposing that words are mor- 
pholexically related in equality terms: lamento is considered a de verbal sustantivation 
of lamentar knowingly of the fact that both are primitive. 

It is necessary to emphasize that there are Spanish words with a close functional 
and semantic relationship that cannot be directly established through a process classi- 
fied within the derivational morphology. So, the concept of extended morpholexical 
relationship is incorporated and encompasses, in addition to the relationships pro- 
duced by derivation, other characterized by having a meeting point in their etymo- 
logical record and by incorporating an ending with the adequate semantic and func- 
tional contribution: audicion is considered suffixaly related with oir, although au- 
dicion is primitive and possesses a different root from oir. 

The extended morpholexical relationship includes to relate words through an end- 
ing graphically coinciding with a suffix if having the semantics and functionality 
corresponding to that suffix: that is the relationship between mid and melaza ('sedi- 
ments of honey'), whose ending coincides with the suffix -azo/a, in spite of the differ- 
ent formative process; as in seise of seis, through the ending -e, being aware of the 
fact that it is regressive of the plural seises — in definitive, it is a noun closely related 
to the adjective seis. 

The extended relationships also includes those cases in which the only impediment 
to establish it are some characters of the ending: between diez and decimo — by anal- 
ogy to the relationship between mil and milesimo, among others, through the numeri- 
cal ending -esimo. 

Although both the inflection and the appreciative derivation constitute suffixal 
morphological processes of interest for the development of automatic natural lan- 
guage processors, neither of this aspects will be dealt here, since it have been settled 
in FLANOM'. 

A suffix is “a phonic sequence added to the end of a word stem, prior to the desi- 
nences, if any, that lacks self existence outside the word system, it can not to be 
joined to other morpheme to form a derivative, it is interchangeable with other suf- 
fixes and it can be added to other stems” [2]. Basing on the previous definition, the 
set of suffixes considered in this work to establish relationships can be enumerated — 
a sequence is considered as suffix if appears in three or more different words. Thus, 
endings like -arra, -aste, -ello, -ingo, -uz fulfil the described suffix definition al- 
though theirs suffixal condition in the grammatical sense is questionable. 

An original word can be any Spanish word with self identity that admits suffixes ad- 
dition to obtain another related word. Deverbal regressions with addition to the endings 
-a, -e, -o and zero-suffix or empty suffix, as well as plurals and appreciative suffixes 
with consolidated meaning are also considered in this work. Suffixal elements — 



* FLANOM: Flexionador y Lematizador Automdtico de Formas Nominales. Developed by the 
Grupo de Estructuras de Datos y Linguistica Computacional of the Las Palmas de Gran Ca- 
naria University, available on Internet. 
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suffixes with a strong semantic load on the original word or those possessing self exis- 
tence — are left for another work, since they are part of the research of composition. 

Even though it is true that most of the relationships between words coincide with a 
formal derivation — mainly the regulars — , the spelling coincidence with a concrete 
suffix can lead to of relationships detached from such linguistic concept, at least by 
means of that suffix, but covering the proposed objective. For example, the words 
psiquiatra and psiquidtrico are related to psiquiatria through the endings -a and - 'ico, 
in spite of being aware of the existence of -iatria and -iatra, the relationship is in- 
tended to be established between previously mentioned words and not with a possible 
root psiqui-. 

A related word can also be original to another by adding a new suffix: 
quej-a quej-ica quejic-oso. The arrow must not to be interpreted as derivation 
in a strict sense, but as a extended morpholexical relationship between two words: 

original and related. 

Changes affecting the root such as erudite words relationships or foreign languages 
influences, among other, are considered irregularities, although they are not pure 
linguistic derivation, since it notably enriches computer applications. The possible 
suffix combinations which lead to a relationship are also considered as irregularities, 
as long as the previously explained continuity criterion cannot be established due to 
the nonconsolidation or nonexistence of the intermediate form in the corpus. Appear- 
ances of interffixes and infixes are considered irregularities. 

Now the alphabetical list of studied suffixes is presented together with their co- 
rresponding number of extended morpholexical relationship: -a (1419), -dceo/a (190), 
-acho/a (35), -aco/a (27), -'ada, -'ade (9), -'ago/a (11), -aino/a (16), -aje (315), -ajo/a 
(65), -al (1574), -ales (6), -allo/a, -alle (36), -amen (34), -dn/a (38), -dneo/a (18), 
-ano/a, -iano/a (732), -aho/a (36), -anza, -enza (127), verbalizer -ar (2635), -ar (407), 
-ario/a (532), -astro/a, -astre (27), -atario/a (36), -ate (33), -dtil (16), -ato/a (294), 
-avo/a (29), -az (22), -azgo/a, -adgo/a (100), -azo/a (562), -azdn (75), -bilidad (263), 
-able, -ible (1088), -abundo, -ebundo, -ibundo, (13), -acidn, -icidn (2637), -'culo/a, 
-dculo/a, -iculo/a (41), -dad, -edad, -idad (1095), -adero/a, -edero/a, -idero/a, (650), 
-adizo/a, -edizo/a, -idizo/a (118), -ado/a, -ido/a (16320), -ador/a, -edor/a, -idor/a, 
(3034), -edumbre, -idumbre (19), -adura, -edura, -idura (789), -aduria, -eduria, -iduria 
(42), -e (542), -ear (1786), -eser (102), -eco/a (32), -edo/a (62), -ego/a, -iego/a (68), 
-ejo/a (86), -el (57), -elo/a (63), -en (17), -enco/a, -engo/a (30), -eno/a (84), -eno/a 
(381), -ense, -iense (417), -enta (16), -ento/a, -iento/a, -ulento/a, -olento/a (125), -eo/a 
(152), -'eo/a (128), -er, -ier (29), -eria (881), -erio (24), -erw (55), -erizo/a (23), -emo/a 
(12), -ero/a (3151), -es/a (207), -esa (29), -esimo/a (36), -ete, -eto/a (394), -euta (14), 
-ez/a (374), -ezno/a (13), -grama (51), -i (49), -I'aco/a, -iaco/a (72), -icio/a, -icie (78), 
-ico/a (35), - 'ico/a (1 690), - 'ide (18), -'ido/a (156), -ificar (135), - 'igo/a (18), -iguar (4), 
-ijo/a (59), - 'il (16), -il (155), -illo/a (797), -imonio/a (6), -(n/a (241), -ina (174), -ineo/a 
(26), -ing (15), -ingo/a (13), -ino/a (526), -ino/a (16), -'io/a (499), -w/a (629), -ion 
(279), -ir (25), -is (9), -ismo (1365), -ista (1325), -istico/a (122), -ita (66), -ita (83), 
-ito/a (258), -'ito/a (18), -itud (57), -ivo/a, -ativo/a, -ilivo/a, (692), -izar (554), -izo/a 
(95), -ma, -ema (20), -ambre, -imbre, -umbre (23), -mente (2432), -amento, -imento, 
-amiento, -imiento (1 876), -ancia, -encia (532), -ander/a, -endero/a (28), -ando/a. 
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-endo/a, -iendo/a, -ondo/a, -iondo/a (87), -ante, -ente, -iente (1617), -o (2740), -ol (17), 
-olo/a, -ol (73), -dn/a (1334), -ongo/a (15), -or/a (286), -'ora (4), -orio/a (89), -oso/a 
(1286), -ote/a, -oto (125), -arro/a, -orro/a, -arrio/a, -orrio/a (55), -5 (46), -asco/a, 
-esco/a, -isco/a, -usco/a (213), -tano/a, -itano/a, -etano/a (44), -aticio/a, -iticio/a (15), 
-'tico/a, -dtico/a, -etico/a, -itico/a, -otico/a (382), -atorio/a, -itorio/a (298), -triz (28), 
-ucho/a (23), -uco/a (17), -udo/a (238), -uelo/a (134), -ujo/a (17), -'ulo/a (45), -uno/a 
(60), -ura (202), -uro (22), -uto/a (33) and other lower frequency endings. 

4 Prefixal Relationships 

Prefixation does not produce grammatical category change; normally, it shades, 
amends, modifies, in short, it guides the meaning of the word. In addition to the pre- 
fixation through traditional elements, there is the certain composition through prefixal 
elements, among other forms of composition — these elements will not he taken into 
account in this work due to the strong semantic load they provide. 

A prefixal morpholexical relationship between a pair of primitive forms is estab- 
lished when, formally, one could be obtained by prefixation from the other and it 
shows semantic relationship inherent in the prefix — culpar and inculpar came di- 
rectly from the Latin culpare and inculpare but one can be obtained from the other by 
adding the prefix in-. 

Independently from the existing suffixal morpholexical relationships, a prefixal re- 
lationship between two original words is projected to the corresponding derived pairs 
— the prefixal relationship between amortizar and desamortizar is applied to amorti- 
zacion with desamortizacidn and to amortizable with desamortizable. 

Apart from the prefixal morpholexical relationships, the suffixal morpholexical re- 
lationships of a word are projected to their corresponding prefixed forms, since their 
mutual morphological, semantic and grammatical relationships are equivalent — 
sobrecalentamiento and sobrecalentar are prefixal forms between which the same 
suffixal morpholexical relationship existing between calentamiento and calentar is 
established. 

The same considerations as in the suffixal alterations are applied to extend the 
concept of prefixal morpholexical relationship: coepiscopo is considered to be prefix- 
aly related with obispo, though coepiscopo is primitive and, from a synchronic per- 
spective, it possesses a different base from that of obispo. 

Extended morpholexical relationship includes to relate words through a start 
graphically coinciding with a prefix if having the semantics and functionality corre- 
sponding to that prefix: a relationship is established between eficiente and deficiente, 
because the initials of the second word coincide with the prefix de- and it verifies 
with the rest of the criterion with respect to the first one. 

Also, other graphic elements providing some type of relationship between words are 
studied, such is the case of the article of Arabic origin al- that permits to establish extended 
morpholexical relationship between juba and aljuba as variants one from the other. 

The relationship between an original word and a prefixed form is semantic, func- 
tional and formal. In the semantic aspect, the prefix incorporates its own specific 
nuance, but the main semantic load corresponds to the original word. Syntactic and 
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grammatical functions tend to be maintained in the prefixed form, though eventually it 
might produce a grammatical category change, due more to use than to the prefixational 
process. The formal differences are adjusted to the general rules of prefixation and to 
those which are specific to each prefix, and, though irregularities exist, they have a 
much lower incidence than that of alterations in other positions of the lexical base. 

Prefixal element, henceforth prefix, can be defined as an affix attached to the front of 
the original word. This definition is so wide that produces two problems when delimit- 
ing what prefixes must be considered in this work without approaching composition. 

Some prefixes have their own functional identity in Spanish — generally preposi- 
tions and some adverbs — , they are called vulgar prefixes and some authors consider 
them compositional elements. However, a prefix coming from one preposition can be 
detached from this since it does not fulfil the same grammatical function, though 
sometimes they agree in meaning. The semantic contribution of prepositions is a 
nominal subordination and that of prefix is a semantic addition and the similarity 
between them is reduced at present, and in a synchronic study, to a phonetic issue. In 
this work, all prefixes of this type are treated in a similar way — each them is accom- 
panied with the cardinality of its extended morpholexical relationships — : a- (615), 
ab- (29), ad- (52), al- (58), ana- (14), ante- (106), con- (516), contra- (313), cuasi- 
(6), de- (287), des- (1 815), e- (50), en- (228), entre- (136), ex- (85), extra- (64), in- 
(1 317), para- (61), por- (3), post- (89), pro- (102), sin- (28), so- (91), sobre- (254), 
sota- (14), sub- (249), ultra- (64). 

The second issue under discussion is what prefixes of the erudite prefixes, pre- 
fixoids, prefixal elements or compositive elements should be included. It is opted for 
generously studying those changing the meaning of the term to which they are joined, 
in an objective or subjective way, and those providing a pronominal or adverbial 
sense to the base. They do not appear as independent terms, and their semantic value 
are generic and applicable to any grammatical category. The prefixes of this type are: 
abiso- (2), aero- (4), ambi- (4), anfi- (6), auto- (185), bar- (7), bati- (5), bi- (111), 
circa- (0), circun- (16), di- (87), diali- (3), ecto- (3), endo- (27), equi- (14), eu- (3), 
exo- (18), hemi- (7), hetero- (32), hiper- (97), hipo- (72), infra- (21), iso- (28), ma- 
cro- (44), maxi- (3), mega- (24), meso- (15), meta- (33), micro- (116), mini- (58), 
mono- (64), multi- (54), omni- (10), opisto- (1), pan- (44), pen- (4), pluri- (11), plus- 
14), poll- (49), preter- (5), proto- (23), retro- (33), semi- (159), super- (220), supra- 
(21), tele- (70), uni- (23), vice- (42), yuxta- (3). Erudite prefixes whose semantic 
contribution is strong are discarded: bio- (life),/oto- (light), metro- (measure), among 
other, and apocopes acting as prefixal elements, like auto- (of automovil), tele- (of 
television) among other. 

Of course, prefixes that the employed sources define as such are considered too: 
anti- (416), apo- (12), archi- (44), cachi- (10), cata- (8), cis- (6), citra- (1), dia- (21), 
dis- (86), epi- (36), es- (82), inter- (161), intra- (35), ob- (20), per- (74), peri- (20), 
pre- (243), re- (995), requete- (3), res- (9), tatara- (3), trans- (258), za- (7). 

A related word can be original to another by adding a new prefix: 
emitir ransmitir retransmitir. The arrow must not to be interpreted as prefixa- 
tion in a strict sense, but as a extended morpholexical relationship between two 
words: original and related. 
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permear 

(verb) 




permeancia permeable permeado 

(substantive) (adjective) (adjective) 




semipermeable permeabilidad impermeable permeabilizar 

(adjective) (substantive) (adjective) (verb) 



Morphological relationship 
Suffixal 
Prefixal 



in% -bilidad -izar in- -do 

\ \ 

impermeabilidad impermeabilizar permeabilizado 

(substantive) (verb) (adjective) 



Gramatical relationship 

# Without grammatical category change 

# From verb to substantive 

# From verb to adjective 

# From adjective to substantive i^permeaMizacion 

# From adjective to verb , . ^ , 

■' (substantive) 




impermeabilizante impermeabilizado 

(adjective) (adjective) 



Fig. 1. Extended morpholexical relationships clan of permear 



Changes affecting the root such as erudite words relationships or foreign languages 
influences, among other, are considered irregularities, although they are not pure 
linguistic prefixations, since they notably enrich computer applications. They are also 
treated as irregularities the prefix combinations causing a relationship, when the pre- 
viously explained continuity criterion cannot be established due to the nonconsolida- 
tion or nonexistence of the intermediate form in the corpus. 



5 Extended Morpholexical Relationships Organization 

The joint formed by an original word and all its morpholexically related words is 
designated as family. Since a word can be related to an original word and at the same 
time to be original word in relationships linked with other words, kinship relation- 
ships between different families are established through this word. All the families 
related in this way compose a clan. 

In order to represent the different types of relationships produced by the rules to 
form Spanish words and by the applied extended criteria, a directed graph has been 
chosen; nodes identify the Spanish words, edges express existence of extended 
morpholexical relationships, the direction of each edge determines the relationship 
between the nodes and the edge labels classify the type of extended morpholexical 
relationship. Spanish words become grouped into disjoint sets of mutually related 
elements — connected components of the graph, or clans — , Figure 1. 
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6 Navigation on the Extended Morpholexical Relationship Graph 

The graph can be gone through in any direction: from one particular node, every other 
node in the same connected component of the graph can be reached, knowing at any 
time the extended morpholexical relationships — crossed edges — until arriving at the 
destination. Depending on the direction chosen — ascending, horizontal or descending — 
the words' morphology and the distance between them are categorized. From one word, 
it is possible to obtain those which have suffered fewer formative processes — 
ascending — , those which have suffered the same number of alterations — horizontal — 
and those which have suffered more formative processes — descending. 

6.1 Derivation 

Derivation is the process by which words are formed, or are related through the ex- 
tended criterion, by alteration of the structure or meaning of others — original — , 
generally with a different grammatical category, though they may be obtained from 
others with the same functionality. By descending the graph one level through suf- 
fixal edges, derivatives of the wished grammatical category are found. Substantive, 
adjective, verbal and adverbial derivation are envisaged. 

6.2 Direct Ancestry 

Direct ancestry permits to obtain the original word with which a concrete word has been 
related — the inverse to derivation or prefixation. Ancestors are found ascending the graph 
one level. Thus, the direct ancestry of the verb tutear is the personal pronoun tii, that of 
verb preconcebir is the verb concebir and in the clan of permear, the direct ancestry of the 
substantive permeabilidad would be the adjective permeable. If the direct ancestry process 
is applied twice, the original from the original from the current node is obtained; so, the 
direct ancestor in two levels of the verb permeabilizar is permear, on the other hand, the 
verb tutear has not this option because the pronoun tu is the root of the graph. 

6.3 Indirect Ancestry 

Indirect ancestry gets the morpholexically related words with the direct ancestors and 
that are found at the same level on the graph. Words of the clan having suffered one 
alteration less than the current word can be obtained in this way. In the clan of per- 
mear, the indirect ancestors of the adjective impermeable are the adjective permeado 
and the substantive permeancia. As occurs with direct ancestry, it is possible to navi- 
gate other levels of morpholexical relationships: second level indirect ancestors of 
permeabilizado are the substantive permeancia and the adjective permeado, the same 
results as with only one level for permeabilidad. 

6.4 Horizontality 

Words morpholexically related with the same original word constitute the horizontal 
direction — they are words with the same number of alterations. They are achieved 
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recovering the direct ancestors and descending one level through all the edges of that 
node. This option recovers all members of a family from one of them — it does not 
include the original word. From the adjective impermeabilizante, the adjective im- 
permeabilizado and the substantive impermeabilizacion are obtained. 

6.5 Descendents 

Descendents from a word are the other members of the family for which it is the 
original word. The level two descendents are the descendent family of each one of the 
members of the family of the original word. In the clan of permear, the descendents 
from the adjective permeable are the substantive permeabilidad, the verb permeabi- 
lizar and the adjectives impermeable and semipermeable. The level two descendents 
from the adjective permeable are the adjective permeabilizado, the substantive im- 
permeabilidad and the verb impermeabilizar. 



7 Filters 

Filters in the extended morpholexical relationships permit selective discrimination of 
navigation response. Results of different types of navigation are susceptible of being 
submitted to various diverse nature filters — functional, regularity and by affixes. 

7.1 Functional 

It consists of accomplishing a selection by grammatical category. Thus, in the clan of 
permear, the only one substantive descendent from the adjective permeable is perme- 
abilidad. 

7.2 Regularity 

It consists of selecting the regular morpholexical relationships or the irregular mor- 
pholexical relationships that maintain the words with respect to the original word. If 
wanting to explore the irregular horizontal relationships of one word, horizontal 
navigation and the irregular filter are applied. 

7.3 Affixal 

The affixal filter makes the selection according to the affixes establishing the mor- 
pholexical relationships — discrimination by one or more affixes can be applied. 

8 Application 

The application interprets and handles with versatility the most relevant aspects of the 
extended morpholexical relationships. It represents a form of showing the system 
power, without damage of its integration in other useful natural language processing 
tools. The interface of RELACIONES, Figure 2, facilitates exploration of the extended 
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Fig. 2. Extended morpholexical relationships interface 

morpholexical relationships of a canonical form of any Spanish word. The Entrada 
permits the user to introduce any word. As a result of recognition, only canonical 
forms that possess extended morpholexical relationships are located in the Forma 
canonica combo box — it permits to request its relationships. 

A set of check boxes permits navigation by the clan of related words while filter- 
ing the results. The related words are shown on the list boxes located on the right of 
the interface which are organized by grammatical category. 

Check boxes grouped under Regularidad filter the response according to the regu- 
larity: regular relationships, irregular relationships or both. 

Direccion and Profundidad check boxes establish which words morphologically 
related with the canonical form are shown. These two check boxes groups are linked 
since when establishing the search direction on the graph, it is needed to specify its 
depth level. The result shows the union of all related words for each one of the Indi- 
cated directions with each one of the depth levels chosen — they are classified in list 
boxes by grammatical categories. 

Three buttons appear under each grammatical category list box — Todos, Puros 
and Mixtos — which permit to select the words of that list only by the list grammatical 
function — Puros — or those having another grammatical function in addition to the 
one defines by the list — Mixtos — , or well to show all words having at least the 
grammatical function defined by the list — Todos. 
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Fig. 3. The interface of extended morpholexical relationships: Suffix 



In the interface of RELACIONES, Figure 3, appears the Sufijos tah sheet — it permits 
to configure suffixal filter. The considered suffixes are shown classified by the gram- 
matical category that they produce and by its appearance frequency. Each group appears 
alphabetically ordered to facilitate location. The Otros tab sheet collects the verbalizer 
suffixes, adverbializer suffixes and others not easily classifiable irregular suffixes — they 
do not appear by appearance frequency. The Prefijos tab sheet permits to configure 
prefixal filter the prefixes are shown classified by its appearance frequency. 
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Abstract. This paper presents SuPor, an environment for extractive Automatic 
Summarization of texts written in Brazilian Portuguese, which can be explored 
by a specialist on AS to select promising strategic features for extraction. By 
combining any number of features, SuPor actually entitles one to investigate the 
performance of distinct AS systems and identify which groups of features are 
more adequate for Brazilian Portuguese. One of its systems has outperformed 
six other extractive summarizers, signaling a significant grouping of features, as 
shown in this paper. 



1 Introduction 

The present work combines classical (e.g., [14]; [5]) and novel approaches (e.g., [2]) 
to AS of English texts, in order to investigate which features can contribute best to 
summarize texts in BP. Specific BP resources were used, namely, electronic 
dictionaries, a lexicon, and a thesaurus (see, for example, [20], [6], and [4]), added to 
BP-driven tools, such as a parser [17], a part-of-speech tagger [1], a stemmer [3], and 
a sentencer. These allowed us to devise AS extractive systems to explore more 
thoroughly the AS of texts in BP. So far, texts under consideration are genre-specific. 
Summarization strategies focus upon both linguistic and non-linguistic constraints, in 
a multifaceted environment that allows the user to choose distinct summarization 
features. Once customized, the environment, hereafter named SuPor (an environment 
for automatic Summarization of texts in PORtuguese) [19], is ready to produce as 
many extracts to the same input as the user wishes, through distinct compression 
rates. By diversifying the groups of selected features, distinct AS strategies may be 
considered, which allow analyzing which features grouping apply better to the 
extraction of text units from texts in Brazilian Portuguese. Like other proposed 
methodologies, SuPor aims at identifying text units that are sufficiently relevant to 
compose an extract. Unlike them, it allows the user to quite freely set which 
combination of features s/he intends to explore for AS. Henceforth, summaries 
automatically generated are named extracts after the extractive methodology, i.e., the 
copy-and-paste of text units considered relevant to include in a summary [16]. 

In Section 2 we briefly describe the approaches for English that have been 
incorporated into SuPor. The distinctive ways to generate extracts are described in 
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Section 3. Preliminary evaluations, described in Section 4, indicate promising 
combinations of features to summarize texts in BP. Final remarks are presented in 
Section 5. 



2 Extractive Approaches Incorporated to SuPor 

SuPor considers four extractive methods explored for the English language, hereafter 
named the ‘Classifier’ [11], which combines corpora-based features, the ‘Lexical 
Chains Method’ [2], which computes connectedness between words, the ‘Relationship 
Map Method’ [29], which performs similarly to the previous one, but considering 
paragraphs instead, and the ‘Importance of Topics Method’ [12], which identifies 
important topics of a source text. 

The Classifier Method uses a Bayesian classifier to train the system in recognizing 
relevant features. These include the sentence length, limited to 5 words the minimum; 
the words frequency; signaling nouns; sentence or paragraph location; and the 
occurrence of proper nouns. The last three features have been firstly addressed by 
Edmundson [5]. As a result of the training phase, a probabilistic distribution is produced 
to allow the automatic summarizer to select sentences through the certified features. 

The Lexical Chains Method computes lexical cohesion through several maps of word 
correlations, considering only nouns as basic significant units of a source text. They 
follow [9, 7] in that strong lexical chains are those whose semantic relationship is more 
expressive. To compute lexical chaining, an ontology and the WordNet [18] are used to 
identify the lexical chaining mechanisms of cohesion (synonymy/antonym, 
hiperonimy/hiponimy, etc.), resulting in a set of strongly correlated words. Three 
diverse summarization heuristics may be applied to select sentences to include in an 
extract based on lexical chaining. The 1“ heuristics selects every sentence S of the 
source text based upon each member M of every strong lexical chain of the set formerly 
computed. S is the sentence that contains the 1st occurrence of M. The 2”“’ heuristics 
applies the former one only to representative members of a strong lexical chain. A 
representative member of a lexical chain is that whose frequency is greater than the 
average frequency of all the words in the chain. Einally, the 3'“* heuristics is based upon 
the representativeness of a given strong lexical chain in every topic of the source text. 

The Relationship Map Method focuses on three distinct ways of interconnecting 
paragraphs, to build maps of correlated text units, yielding the following paths: the 
dense or bushy, the deep, and the segmented ones. Dense paths are those with more 
connections in the map. To build them, top-ranked paragraphs are chosen totally 
independent from each other. Because of this, texture (i.e., cohesion and coherence) is 
not guaranteed. Trying to overcome this, the deep path focuses on paragraphs that are 
semantically inter-related. Flowever, a unique topic may be conveyed in the extract 
which may not be the main topic of the source text. The segmented bushy path aims at 
overcoming the bottlenecks of the former methods by addressing distinct topics of the 
source text. 

Finally, the Importance of Topics Method is based upon the so-called TF-ISF, or 
Term Frequency-Inverse Sentence Frequency measure, which identifies sentences that 
may uniquely convey relevant topics to include in the extract. Topics are delimited by 
the Text Tiling algorithm [8]. 
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Besides the methods variations themselves, the following differences are observed: 
(a) the Classifier is the only one that depends on genre and domain, because it requires 
training on corpora of texts; (b) the Lexical Chains one is the only one that does not 
allow a compression rate to be specified, because it is the heuristic that determines the 
number of extract sentences and not the user; (c) the Relationship Map method is the 
only one that deals with paragraphs, instead of sentences, as minimal units. 
Preprocessing applies to all of them, aiming at producing the source texts internal 
representation. However, distinct tools are used, as follows (methods that embed them 
are pinpointed between brackets): stopwords removal (all but Lexical Chaining), 
stemming (Relationship Map and Importance of Topics), segmenting (all), tagging and 
parsing (only Lexical Chaining), and 4-grams extraction (only Importance of Topics). 
This is used for Romance languages, if no stemming is available. 

The Lexical Chains Method is the most costly, since it depends on sophisticated 
linguistic resources. Actually, this is the only method that has been significantly 
modified for Brazilian Portuguese: for the lack of an ontology and a WordNet for BP, 
its implementation to BP in SuPor uses a thesaurus instead [4]. Other minor adaptations 
of the referred methods to SuPor have been made, which are detailed in Section 3. 

Our selection of these models has been limited, on one hand, to the already 
available linguistic resources at NILC’ and, on the other hand, to their portability to 
BP. Promising aspects of such methods included the following: a) in the Classifier 
Method, metrics for AS can be made available through training and the automatic 
summarizer may be modeled upon relevant corpora-based features; b) focusing on 
expressive relationships between lexical items, the Lexical Chains Method increases 
the chance of selecting sentences better; c) similarly to the previous one, both the 
Relationship Map and Importance of Topics methods target more coherence in 
focusing upon the interconnectedness of the topics of a source text. However, the 
latter innovates in using text tiling to determine the relevant ones. 

Problematic aspects of those approaches are still common to most of the existing 

AS proposals. They refer, for example, to a) the need to have a corpus of ideal 

2 

summaries for training the Classifier; b) the need to provide specific, domain- 
dependent, information repositories, such as the list of signaling and proper nouns for 
the Classifier or the lexicon and the ontology for the Lexical Chains Method, added to 
taggers and parsers; c) the costly implementation of sophisticated methods, such as 
the Lexical Chains one. However struggling those aspects may be, SuPor ultimately 
aims at certifying that linguistic information helps producing more satisfactory output 
extracts with respect to both, information reproduction and coherence. We should 
notice, though, that the only property addressed in this paper is content selection, and 
not coherence. 

3 SuPor Architecture 

SuPor comprises training (Figure 1) and extraction (Figure 2) . During training, each 
feature is weighed by measuring its representativeness in both the source texts and 

* Niicleo Interinstitucional de Lingmstica Computacional (http://www.nilc.icmc.usp.br). 

^ As defined by Teufel and Moens [31], ideal summaries are well-formed and satisfactorily 
corresponding to their respective source texts and, thus, can be considered guidelines. 

^ Dotted frames signal NILC knowledge repositories for BP. 




422 



L.H. Machado Rino and M. Modolo 



their corresponding ideal extracts. These have been built by correlating each sentence 
of authentic, manually produced, summaries with those sentences of the 
corresponding source texts. The manual summaries have been built by a professional 
summarizer. Correlations are based on the cosine similarity measure [30]. Relevant 
features for AS are pointed out through training, yielding a probabilistic distribution. 
This signals the probability of a given feature to occur in both texts, as in Kupiec 
et al.’s approach [11]. 
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Fig. 1. SuPor training module 

After training, the specialized user can customize SuPor to summarize any source 
text by (a) pinpointing the way the source text must be preprocessed (either removing 
stopwords and stemming what remains, or producing a 4-gram distribution); (b) 
selecting the group of features that will drive summarization; and (c) specifying the 
compression rate. These options are indicated in the user interface in Figure 2. The 
main features of the extraction module are detailed below. 

3.1 Features Selection 

Features selection is actually at the core of SuPor, since it allows for distinct AS 
strategies to be used. Through them, the strategies may address either non-linguistic 
or linguistic knowledge. The former includes those used in the Classifier Method, i.e., 
sentence length, proper names, location of sentences in a paragraph and location of a 
paragraph in the text, and keywords (i.e., the words frequency). As originally 
proposed in that method for texts in English, only initial and final paragraphs of a 
source text and sentences of a paragraph (with a minimum of 2 sentences) are selected 
by SuPor. 

Linguistic knowledge is embedded in SuPor through the manipulation of surface 
indicators of linguistically related information, such as those that link paragraphs, 
lexical chaining, or determining topics. Trying to overcome the connectedness 
problems introduced by the reused methods, in some cases we introduce some 
changes on them. For example, for the Relationship Map method, all the paths are 
calculated and all the resulting paragraphs are incorporated in just one extract. 
Oppositely, the Importance of Topics method has been fully embedded in SuPor. 
Similarly to the Lexical Chains Method, SuPor focuses on a single, or on the most 
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nuclear, noun, when a noun compound is focused upon. However, differently from 
English, the corresponding Brazilian Portuguese chains to the noun compounds are 
actually chains of adjectives as modifiers of only one noun. So, determining their 
nuclei is simpler than it is in English; it is enough to use NILC’s tagger [1]. Strong 
lexical chains are determined by counting the number of words and the number of 
their repetitions in the source text. The heuristics of the original method have also 
been incorporated in SuPor, to build just one extract. 




Fig. 2. SuPor extraction module 



The Classifier Method is not explicit in Figure 2. Instead, it is depicted by selecting 
features 2 to 5. In all, SuPor embeds the seven features outlined in the box ‘Feature 
selection’ in Figure 2. 
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3.2 Linguistic and Empirical Resonrces 

SuPor BP-oriented linguistic resources include a stoplist, a lexicon [20], and a 
thesaurus [4]. Added to these, the probabilistic distribution resulting from training is 
the most important empirical resource in SuPor. As mentioned before, SuPor generic 
tools (i.e., the text segmenter, the stemmer, and the tagger) do not necessarily run 
altogether. Their activation depends on the preprocessing option and on the group of 
features selected by the user. Text segmentation focuses on splitting the source text 
into sentences by applying simple rules based on punctuation marks. Paragraphs are 
also marked. The BP stemmer [3] is an adaptation of Porter’s algorithm [25], 
accounting for irregular verb forms and noun and verb suffixation in BP. Since 4- 
gramming has been indicated as a substitute for stemming for BP [12], the user may 
also choose between both preprocessing options, to produce the source text internal 
vector. The tagger has been trained for BP under the generic tagset defined by 
Ratnaparkhi [27]. 

By customizing SuPor, the user may delineate the AS strategy, the granularity of 
text segmenting (either paragraphs or sentences), and the compression rate. The 
relevant text units can then be classified, to produce an extract. This is done in the 
following way: firstly, every sentence that presents at least one of the features of the 
selected group is considered; secondly, sentences are classified according to their 
likelihood of being included in the extract. Likelihood is signaled by the probabilistic 
distribution of features resulting from training. Considering preprocessing and 
extraction options, SuPor amounts to 348 diverse summarization strategies. 



4 Assessment of the SuPor Strategies 

SuPor strategies have been assessed in a small corpus, from which the strategies with 
better performance are inferred. Then, one of its strategies was compared to other 
extractive summarizers. Both experiments have been carried out in a blackbox way, 
i.e., by computing measures only on the produced extracts. A brief description of 
them is given below. 

4.1 Informativeness and Features Representativeness 

The assessment described here was limited to measuring the degree of 
informativeness of SuPor extracts. In doing that, we could also assess the 
representativeness of both the selected features and preprocessing options. So, we 
compared distinct features groupings to identify those leading to better results in 
summarizing BP texts. The test corpus included 51 newspaper articles (ca. 1 to 3 
pages long) of varied domains. They were chosen for their small size and readership 
coverage, in order to ease both the tasks of hand-building reference summaries and 
evaluating the output extracts. The test corpus was also used for training. For this 
reason, 5-fold cross-validation was used. 

Two experiments were carried out for similarity: in the first, the output extracts 
were compared with ideal extracts', in the second, they were compared with ideal 
summaries instead. Every condensed text was obtained on a ca. 28% compression rate 
(and is, therefore, ca. 8 sentences long). Ideal extracts are those automatically 
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generated by an automatic generator of ideal extracts. Ideal summaries are those 
hand-built by BP fluent writers and, thus, result from a rewriting task. Only co- 
selection measures [26] were used, namely, precision (P), recall (R), and the balanced 
F-measure (F). P is the ratio between the number of relevant sentences included in the 
output extract and its total number of sentences. R is the ratio between the same 
number of relevant sentences and the total number of sentences in the ideal reference 
(either extracts or summaries). In the first experiment, those measures were computed 
in a Boolean fashion (i.e., presence or absence of sentences of the output extract in the 
corresponding ideal extract). Flereafter, this procedure is named sentence co-selection. 
In the second, they were calculated through a comparison of the output extract and its 
corresponding ideal summary. This procedure is named content-based similarity. 

To overcome the impossibility of pairing sentences in the second experiment, each 
sentence of the output extract was compared with all the sentences of the ideal 
summary, as suggested by Mani [15], through a variation of the cosine similarity 
measure. 

Finally, to compute P and R for content-based similarity, the obtained values were 
normalized to the interval [0,1] (most similar equals 1). After computing those for the 
full collection of extracts, average measures were produced for the 348 features 
groupings (FGs) provided by SuPor (amounting to 17.748 extracts). The most 
significant figures are shown in Table 1, along with their rankings in the whole 
collection. 



Table 1. Average F-measures for both procedures 



Features grouping 


Sentence co-selection 


Content-based similarity I 




F-measure 


Ranking 


F-measure 


Ranking 


EG, 


0.40 


1°. 


0.42 


2°. 


FG, 


0.40 


1°. 


0.39 


25°. 


fg 


0.38 


12°. 


0.43 


1°. 



The features groupings with the biggest, and equal, F-measures in the sentence co- 
selection procedure are FGl=[lexical chaining, sentence length, proper nouns] and 
FG2=[lexical chaining, sentence length, words frequency], both running under the 4- 
grams preprocessing option. In the content-based similarity procedure, the best 
grouping was FG3=[lexical chaining, relationship map], signaling that the 
combination of the two full methods. Lexical Chaining and the Relationship Map one, 
applied better to the test corpus. In this case, preprocessing options differed: Lexical 
Chaining ran on text tiling and the Relationship Map Method, on 4-gramming. 

By comparing FGl with FG2, we can see that extracting sentences based upon 
proper nouns or words frequency makes no difference in content selection. However, 
performance based upon proper nouns was slightly better in the second procedure. 
This may indicate that SuPor performs closer to ideal summaries when proper nouns 
are focused upon, instead of words frequency. With respect to using any of the varied 
features along lexical chaining, the comparison between FGl and FG3 shows that the 
Relationship Map Method still outperforms them in the second experiment. After all, 
it is worth noticing that both the Lexical Chains and Relationship Map methods use 

This is also based upon the cosine similarity measure [30] and can be found in http:// 
www.nilc.icmc.usp.br/~thiago. 
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the connectedness between text units to indicate the relevant ones. The fact that the 
latter uses paragraphs instead of sentences may be significant in improving 
performance. Overall, it is noticeable that, although 7 features were available in 
SuPor, the inclusion of other features to any of the topmost groupings did not provide 
a significant performance improvement. Additionally, it did not deteriorate the results 
either. So, the best results indicate that those strategies that comprised only 3 out of 7 
features should be better explored. 

In both experiments lexical chaining was based on text tiling at the preprocessing 
stage. It was also the commonest feature in the great majority of the topmost features 
groupings and the least common in those groupings with the smallest F-measures. The 
least representative feature was ‘Importance of topics’ : it appears in most of the worst 
figures and is absent in the best ones. This lack of representativeness may be due to 
the size of the source texts: for being small, proper identification of their topics could 
have been damaged. In other words, just one topic may be chosen, which does not 
convey the important ones. The fluency of news texts may also imply that topic 
change is too subtle for automatic detection. 

Preprocessing by paragraphing or text tiling evenly influenced the figures in both 
experiments. However, text tiling showed a slight improvement in SuPor 
performance. Input stemming or 4-gramming resulted differently, though: in sentence 
co-selection, the preprocessing mode is not relevant; in content-based similarity, 4- 
gramming yielded better results for BP. Recall measures are smaller than precision 
ones in both experiments, but in the second one, they slightly outperform the first. 
This may indicate that, in comparing output extracts with ideal summaries, instead of 
comparing them with ideal extracts, more sentences are considered similar. However, 
more experiments are needed to confirm this finding. 

4.2 Comparing Just One Strategy with Other Extractive Summarizers 

Based on the former experiment, only one strategy of SuPor was chosen for a more 
thorough comparison with other six systems, as reported in [28], combining the 
following features: location (of sentences in a paragraph and of a paragraph in the 
text), words frequency, sentence length, proper nouns, and lexical chaining.. The 
systems considered are the following (a brief description on how they identify 
sentences to include in an extract is given): (1) TF-ISF-Summ (Term Frequency- 
Inverse Sentence Frequency-based Summarizer) [12], which mirrors Salton’s TF-IDF 
information retrieval measure [30] in its pulling out documents from a collection, but 
correspondingly considering sentences from a single source text; (2) NeuralSumm 
(Neural Summarizer) [23], which is driven by an unsupervised neural network for 
sentence determination, based on a self-organizing map (SOM) [10]; (3) GistSumm 
(Gist Summarizer) [22], which matches lexical items of the source text against 
lexical items of its gist sentence; finally, (4) ClassSumm [13], which is also a 
classification-based summarizer, like Kupiec et al.’s one, which is embedded in 
SuPor. Added to those, the baseline from-top sentences and random-based systems 
were used. 

Location was the only feature included in SuPor summarizer that was not 
representative in the former evaluation. It was considered because it is a common 
feature to 3 out of the 6 other systems. The experiment was carried out in a blackbox 
fashion on a single, distinct from the former, test corpus [24]. This comprises 100 
newspaper texts, paired with hand-produced summaries written by a consultant on the 
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Brazilian Portuguese language. For this reason, they were considered our corpus of 
ideal summaries. Similarly to the former experiment, a 10-fold cross validation was 
used, under a 30% compression rate (approximated to the ideal summaries and 
extracts). 

Run independently, the systems performances were also assessed similarly to the 
former experiment; average precision, recall and F-measures were obtained 
automatically. As shown in Table 2, SuPor summarizer outperformed the other 
systems, having a ca. 0.38 f-measure over the baseline random and a 0.43 f-measure 
when comparing absolute average performances. 

Although a distinct, and bigger, test corpus was used in this experiment and 
location was considered, there was no improvement in SuPor performance, when 
compared to the former experiment. However, the figures show that the selected 
features make it close to the performance of ClassSumm, which is also a classification 
system. This may not be surprising, because both systems are based on a Bayesian 
classifier. However, ClassSumm uses a total of 16 features associated to each 
sentence. It should be interesting to investigate, thus, if SuPor performance is 
representative enough of the lack of improvement of AS strategies when more 
features are added. 



Table 2. Systems performance (in %) 



Systems 


Avg. P 


Avg. R 


Avg. F 


Avg. F over random 


SuPor 


44.9 


40.8 


42.8 


38 


ClassSumm 


45.6 


39.7 


42.4 


37 


From-top 


42.9 


32.6 


37.0 


19 


TF-ISF-Summ 


39.6 


34.3 


36.8 


19 


Gists umm 


49.9 


25.6 


33.8 


9 


NeuralSumm 


36.0 


29.5 


32.4 


5 


Random order 


34.0 


28.5 


31.0 


0 



5 Final Remarks 

The reported experiment showed that, of the five features, SuPor summarizer is 
distinctive on its lexical chaining (this is the only feature that differentiates it from the 
other systems). This confirms the first experiment. However, the features grouping, 
itself, should be better analyzed, for results may improve because of the combination 
of each component of such a grouping. With respect to combining features, SuPor 
provides a meaningful environment for the user to explore empirical measures to 
determine the relevance of text units, as both experiments show. Although the 
Bayesian classifier by Kupiec et al. also incorporates a combination of features, SuPor 
goes one step further in allowing any number of features to be chosen out of the seven 
largely considered in the field nowadays. However, SuPor usability has not yet been 
assessed. One of the reasons is that it offers too many summarizing possibilities. A 
more productive strategy is to limit it to just the most promising features groupings, in 
order to carry out more comprehensive tests. Besides, to make better use of SuPor, 
more people should be available to work with it. There should be no trouble in this, 
from the user viewpoint, for it runs in a pretty friendly, Windows-based platform. The 
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problem rests in the expertise level of the user: the more specialist the user on AS 
features, the more directly or deeply s/he could assess the influence of linguistic 
features in the production of the extracts. This is one of the most interesting aspects 
provided by SuPor: to select, for example, only non-linguistic features and run it. 
Then, running it only on the linguistic ones, in order to compare the results obtained 
through both groups of features. Actually, our preliminary assessments showed that 
considering linguistic features (such as lexical chaining and relationships mapping) 
outperforms the results produced when considering only non-linguistic ones. 

Even though the comparative analysis has been made on a very small corpus, the 
results are promising. SuPor exploration may well end up as a means to specify a 
useful benchmark comparison to the field. So far, we have no knowledge of the 
existence of similar environments. CAST [21] seems to be the closer to SuPor one can 
get: it also considers a set of features, including lexical cohesion [9], but it aims at 
giving support to human summarizers instead of providing the means to deeply 
explore distinct summarizing strategies and their potential combinations. Clearly, 
CAST writers could not assess AS strategies from the viewpoint of research and 
development of AS systems, as it is intended in SuPor. So, it would be interesting to 
put together CAST and SuPor environments, to complement each other: CAST allows 
registering feedback from the writer, annotating important sentences signaled by its 
distinct summarization methods, and comparing results. By running SuPor on 
common corpora to CAST, its results could be thus compared to information obtained 
through CAST registers. Also, similar experiments to the ones reported above could 
be carried out involving both CAST and SuPor. In order to do so, CAST should be 
also assessed as a summarizing tool. To our knowledge, this has not been done so far, 
which makes SuPor assessment yet more useful. However, in considering CAST and 
SuPor altogether, common NL-dependent resources must be provided. 
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Abstract. In segmenting a given language or languages in general in a 
systemic manner, we show how it is possible to effect computations which 
lead to reliable human language technology applications. Some problems 
have not yet been solved in language processing; we will show how in 
applying the theory “systemic analysis” and the calculus of “SyGuLAC” 
(Systemic Grammar using a Linguistically motivated Algebra and Calcu- 
lus), we can solve difficult problems. Systems such as the ones outlined in 
the paper can be created whenever there is a need. The only requirements 
are that such systems can be manipulated, and that they be verifiable 
and above all traceable. A system ought to be computable and be able 
to be represented in its entirety; if not it cannot be verified. 

Keywords: human language technology, language calculability, system, 
systemic analysis, agreement of French participle. 



1 Introduction 

The intention of this paper is to show how by segmenting a language in a systemic 
manner it is possible to obtain human language technology applications which 
give reliable results. We will show firstly with an example why systemic analysis 
is useful and secondly the results of an application of systemic analysis knowing 
that complementary processes have to be done at the level of the sentence. 

2 Systemic Analysis 

Whether one examines a language from the point of view of its components 
(signifiers) in the Saussurian sense or of its organisation (syntax), languages 
fortunately present regularities which ought to be systematically brought into 
prominence. Grammarians have always studied the problem of language descrip- 
tion, and regularities and rules have been discerned as much for syntax as for 
lexis at least for the teaching or the comparison of languages. With the advent of 
the digital computer, we are now confronted with problems concerning language 
analysis, recognition and generation by machine. 

Systemic analysis [1], [4] is based on the postulate that a language or indeed 
languages (as in translation) can be segmented into individual systems based on 
the observation that such systems influence each other. 
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The systems that we are talking about have to be sufficiently small so as to 
be humanly prehensile and subsequently be able to be processed by machine, 
but also sufficiently large so as to function formally as a whole. Each such system 
may be a component of a larger system to which the system’s properties can be 
extended. 

Furthermore it must also be possible to create links between the component 
systems of a system in order to obtain a complete formal description of the 
system. The component systems of such a system may be nested, have common 
parts or function in identical and identifiable manners. All levels of language 
analysis can be described in this way (lexis, syntax, morpho-syntax, semantics, 
morpho-semantics and others). Because of this, what must then be done is to 
describe the invariant part of a system as well as its variant parts and to state 
formally the reason for the variations. 

The advantage of this systemic analysis model is that it is reliable and finely 
grained. Systemic analysis is conducive to mathematical modelling by means of 
mathematical structures which place language first and incorporate constraints 
such as to maximise the coverage. Such modelling involves for example the mod- 
elling of grammatical classifications (by means nested partitioning and thus the 
use of equivalence relations) and modelling for computational ends (entailing 
model theoretical and proof theoretical approaches) including the benefits pro- 
vided by normalising (supported by means of lattice theory) such as algebraic 
and calculational operations on analyses [6], [7]. A systemic analysis can be 
viewed as a concordance for the specific linguistic phenomenon 

treated in which each case (equivalence class) is treated. With systemic lin- 
guistics, one can, amongst other things, verify that a system of relations is well 
formed in terms of strict nesting constraints [5], [8]. Furthermore systemic analy- 
sis allows tracing of operations done by machine and thus omissions can be easily 
added. The methodology is suited as much to lexis as to syntax, to semantics or 
to their intersections (morpho-syntax etc.) [2]. 

A properly conducted analysis incorporates for each case a justification, from 
which benchmarks can be constructed. Furthermore, systemic analysis is neu- 
tral in marrying linguistic performance and linguistic competence where both 
empiricism and thesis are admitted. The evaluation of a given systemic analysis 
is in reality a clear matter. One has to ensure that the analysis is sufficiently 
fine (are there still further cases?) and that each case revealed in the analysis is 
accompanied with a justification (with the nature of the justification itself justi- 
fied). Justifications and their nature can depend on the application. For example 
there are applications such as controlled languages used in safety critical systems 
that demand a corpus based and indeed experimental based approach (and thus 
are performance based) together with provenance information, these providing 
a static reproducible trace for each case. In this matter systemic linguistics with 
its inherent traceability marries well with systemic quality. By default the jus- 
tification is synthetic, devised by the linguist and depending thus on his or her 
competence. We see too that in terms of scientific method systemic linguistics 
methodology accommodates to theoretical and intuitional approaches as well as 
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to empirical and experimental approaches. Indeed, in Centre Tesniere we use 
systemic analysis itself as a met a analysis tool for producing linguistic systemic 
analyses. 

In this paper we give an example of the application of the ” systemic theory” 
of French lexico-morpho-syntax for the agreement of the French past participle. 
We thus will also see that syntax cannot be in reality separated from lexis and 
from morphology. We show how such a system can be represented so that it can 
be processed by machine. 

Because in general the different systems influence each other, what has to 
be done is to describe for each system its invariant part, which allows us to 
find and indeed discover the system, to name it and to give it prominence as a 
system in its own right. The variant parts of a system need also to be described, 
these parts being due to the system being put into relation with other systems 
of the language system, this in the knowledge that a system can be related with 
several systems which in their turn can influence one another. Thus to begin 
with we have a recognizable “invariant” system which is necessarily canonical 
and another system representing its variant parts. However, we are missing the 
system which relates these two systems, this being a system of relationships. It 
is these systems of relationships which allow us to organize system 1 (canonical) 
according to the systems 2 (variations). 

3 Methodology 

To illustrate the systemic modeling approach, we take as an example one which 
is indeed quite complex, and this is the agreement of the French past participle 
followed by an infinitive [1]. We have here three systems: 

— System 1: the past participles (represented below by a.,b.,...,h.) 

— System 2: their inflexions (represented below by I, A) 

— System 3: the relationship that structures the past participle system 1 ac- 
cording to the inflexional system 2 (represented by what is between a. and 
I, for example) 

What is in fact going to be the most difficult thing to do is not so much 
describing the first two systems, which involves finding for system 1 all the 
French past participles as well as for system 2 all their inflected forms, but it 
is rather how to describe system 3 which links systems 1 and 2. In creating 
this relationship, it is necessary to say when a link can exist and when not. 
Furthermore, the representation used for system 3 will depend on whether we 
envisage use by humans or automatic processing by machine, but in both cases 
an algorithm [9] is necessary (this being the Calculus in SyGuLAC). In the 
former case the algorithm could take the following form: 

Algorithm 

— a. the past participle is preceded by ”en” ^ I 

— b. the past participle of the verb faire is immediately followed by an infinitive 

^ I 
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~ c. the past participle of the verb laisser is immediately followed by an in- 
finitive — s- I or A 

— d. eu, donne, laisse are followed by ”a” and the direct object refers to the 
infinitive ^ I 

~ e. the verb (past participle) marks an opinion or a declaration (List) — > I 
~ f. the direct object performs the action of the infinitive — > A 
~ g. the direct object refers to the past participle — > A 

— h. the direct object refers to the infinitive — > I 

List: affirme, assure, cm, dit, espere, estime, nie, pense, pretendu, promis, re- 
connu, voulu 

Operators: I : invariable A : agreement 

Agreement of the French Past Participle: Interactive Algorithm 

whilst in the latter case, that of automatic processing, the representation of the 
algorithm could take the following form: 

— a’. Pl(,)(adv)en(P3)(adv)(P4)(adv)avoir(adj)(P3)(P4)(adv)p.p.(adv) 
(prep)(P5)inf(P5) ^ I 

— b’. PI (,)(adv)(P2)(adv)(P3)(adv)en(adv) avoir (adj)(P3)(adv)p.p.(adv) 
(prep)(P5)inf(P5) ^ I 

— c’. Pl(,)(adv)P2(adv)(P)(P3)(adv)(P4)(adv)avoir(adj)(P3)(P4)(adv) 
fait(adv)(P5)inf(P5) ^ I 

~ d’. Pl(,)(adv)P2(adv)(P)(P3)(adv)(P4)(adv)avoir(adj)(P3)(P4)(adv) 

List (adv) (prep) (P5)inf(P5) ^ I 

— k’. Pl(,)(adv)P2(adv)(P)(P3)(adv)(P4)(adv)avoir(adj)(p3)(P4)(adv) 
p.p.(adv)(prep)P5(adv)inf ^ I 

• u’. P5 is the direct object ^ agreement with P2 

* g’. P2 = ”que” ^ agreement with PI 

* h’. if P3 exists ^ agreement with PI 

— j’. Pl(,)(adv)P2(adv)(P)(P3)(adv)(P4)(adv)avoir(adj)(P3)(P4)(adv) 
p.p.(adv)(prep)inf(adv)P5 ^ I 

• u’. P5 is the direct object ^ agreement with P2 

* g’. P2 = ”que” — > agreement with PI 

* h’. if P3 exists ^ agreement with PI 

Agreement of the French Past Participle: Automatic Algorithm 

The execution of such algorithms is performed in the following manner. Tak- 
ing for example the automatic algorithm: 

If a’ is true then the past participle is invariable (operator I); 

If a’, b’, c’, d’ are false and k’ true then if u’ is true then if g’ is true then 
make the agreement with PI; otherwise, g’ being false then if h’ is false apply 
the operator of the last true condition, u’, that is to say, make the agreement 
with P2. 

With the algorithm (or super-system) which incorporates these three systems, 
we can, for example, obtain the correct response (agreement or no agreement) 
for a sentence such as: 
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Ces bucherons, je les ai fait? sortir 
PI , P2 P3 avoir p.p. inf 
(These lumberjacks, I have made them leave) 

If we examine the automatic algorithm, we find that the structure shown at 
condition c’ is the appropriate one: 

— c’. Pl(,)(adv)P2(adv)(P)(P3)(adv)(P4)(adv)avoir(adj)(P3)(P4)(adv) 
fait(adv)(P5)inf(P5) ^ I 

which provides the correct response, operator I, this being that there is no agree- 
ment of the past participle. 

We can see that all the parts or layers (lexis, syntax, morphology, and se- 
mantics) of linguistic analysis are involved together in an organized system. 



4 Application 

Now, let us take a text representing all the cases of agreement of the past par- 
ticiple of pronominal verbs in French. We see here the concern for exhaustive 
analyses in systemic linguistics. The text can be the basis of a bench mark, and 
for the automatic algorithm, an automatic test data set (thus enabling regression 
testing). It is for this reason that we reproduce the text in full. 



4.1 Text 

(due to Yves Gentilhomme) 
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4.2 Formal Representation 

Following our representation of the solution of this second problem, the agree- 
ment of the past participle of pronominal verbs in French by a second algorithm, 
we show how to perform automatically, with an example, the agreement of a 
past participle in the text given above. This second algorithm could take the 
following form : 

Algorithm 

— a. the subject is the impersonal pronoun ”il” ^ I 

— b. the verb is se plaire, se deplaire, se complaire, se rire ^ I 

— c. the past participle of the verb is followed by an infinitive ^ I 

• d. the subject of the past participle performs the action of the infinitive 
^ A with the direct object 

— e. “se” is the indirect object or the second object ^ I 

• f. the direct object precedes the past particple ^ A with the direct 
object 

For the other cases, the general rules is: A with the subject 
Operators: I: invariable A: agreement 

Agreement of the French Past Participle: Interactive Algorithm 
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We do not present the methodology or the calculus; these would be organized 
in the same manner as in the first example given above, that of the agreement of 
the French past participle followed by an infinitive. The only thing to be noticed 
is that when the object is and it is that governs the agreement, in fact we 
have to refer to the subject to do the agreement (see PI below). 



4.3 Analysis 

Let us take the first sentence of our text to check automatically its past partici- 
ples: 



( 1 ) ’ O’ 0 



The first step of our grammar checker is the tagging of each unit of the 
sentence, this is performed by a morpho-syntactic analysis. 

We use the automatic morphological disambiguated tagging of texts system, 
Labelgram [2], [5], which has been developed in the L. Tesniere Research Centre 
(see http:/ /www. labelgram. com). This system is also based on systemic linguis- 
tics and as such serves as an evaluation of the methodology. In Labelgram, the 
overall systemic grammars are ’super’ systems modelling the relationship be- 
tween texts and morphologically disambiguated tagged texts. 

One of the strengths of this system is its intentionally driven approach. This 
approach is inherent to systemic linguistics in which the methodology captures 
the grammarian’s concern with language description, the regularities present in 
languages and their exceptions. This intentionally driven approach leads not 
only to representations which are efficient in terms of size [3] but which accept 
for example, in the case of word dictionaries, neologisms ’’obeying the rules for 
French word formation” . For example in Labelgram’s French raw grammatical 
category tagger, the context rule which recognises and tags word forms ending 
in at least: -er has 579 entries whilst the electronic Robert dictionary of French, 
an extensional dictionary, has 6666 entries. The overall default context tags 
(perhaps new) words in -er as Verbe inf. (verb infinitives) . 

Moreover, Labelgram disambiguates each unit of the sentence; a first super- 
system finds all grammatical categories of each unit and a second super-system 
calculates the right category in the given context; the complete system being the 
forward relational composition of these two systems (an example of the Algebra 
in SyGuLAC). 

For example, the unit in our sentence can be: 

— a Substantive: blesse ’ (The hurt man is led to 

the hospital.) 

— a Past participle: blesse ’ (The hunter hurt the 

animal.) 

— an Adjective: ’ blesse (The hurt and 

tired man cannot run.) 
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Using context rules (our ambiguous unit follows a verb and precedes a comma), 
the second super-system of Labelgram detects that in this case the unit 
is a Past Participle. 

The system correctly tags each unit of our sentence and gives the following 
result: 



Table 1. Result of the Labelgram analysis - http://www.labelgram.com 



Mot 


Categories 


Categoric 


jeanneton 


[Nom] 


Nom 


s 


[Pro. pers.j 


Pro. pers. 


est 


[Nom, Verbe conj.] 


Verbe conj. 


blesse 


[Nom, Ppa., Adj.] 


Ppa. 




[virgule] 


virgule 


elle 


[Pro. pers.] 


Pro. pers. 


s 


[Pro. pers.] 


Pro. pers. 


est 


[Nom, Verbe conj.] 


Verbe conj. 


entaille 


[Ppa., Adj.] 


Ppa. 


profondement 


[Adv.] 


Adv. 


deux 


[Adj. num. car.] 


Adj. num. car. 


phalangettes 


[Nom] 


Nom 


avec 


[Adv., Prp.] 


Prp. 


une 


[Art. ind., Adj. num. car., Adj. num. ord., Nom] 


Art. ind. 


faucille 


[Nom] 


Nom 



A second major strength of Labelgram which it owes to systemic analysis is 
that it can treat sequences of units with grammatical category ambiguities, and 
this is indeed the case in our sentence as is shown above. 

This tagging is essential for the second step of our checking system. Indeed, 
in order to simplify the syntactic analysis for the automatic checking of the 
agreement of the past participle, our methodology consists in the reduction of 
the sentence to its “minimum” , analysing the parts of speech of each unit of the 
sentence [10]. The aim is to keep only the elements necessary for our application; 
this means the sentence’s subject and verb, and sometimes its objects. 

For our example sentence, the system detects two past participles (Ppa.) to 
check : et .It should thus find two structures. 

The “superficial” units of the sentence correspond to the elements in paren- 
theses in the algorithm of our system 3: the adjectives, the adverbs, but also the 
incidental complements, the complements of the name, etc. In reality, none of 
these elements has any influence on the agreement of the past participle. Their 
role is to specify variously the place, the manner and the time of the action. 
These are semantic elements which are not relevant for this application. This is 
the reason why they can be separated out and removed by our grammar cor- 
recting system. 
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Thus, for our start sentence: 

(1) ’ 0 ’ 0 profondement 

avec une faucille 

we obtain after the simplification system, the following two “sub-sentences”: 

(la) 

(lb) 



These two sub-sentences correspond to the following two syntactic structures 
respectively: 



(la) PI (Pro.pers.) (Verbe conj.) (Ppa.) 

(lb) PI (Pro. Pers.) (Verbe conj.) (Ppa.) P2 

Our system 3 then applies the agreement rule linked to each of the obtained 
structures in order to check the past participle and correct it if need be. 

For the first structure: 



(la) 

(la) PI (Pro.pers.) (Verbe conj.) (Ppa.) 

system 3 contains the rule: 

— PI Pro.pers. etre Ppa. — > Agreement with PI 

Thus, the system checks that the past participle is in correct agreement 
in gender and number with the subject PI : 



• gender: feminine 

• number: singular 

• gender: masculine 

• number : singular 

The system detects an error and corrects it automatically: 

(laCorrected) ’ e 

As we have argued elsewhere [2], in some cases we need a semantic or prag- 
matic analysis to complete the syntactic analysis. 

For the second structure: 

(lb) 

(lb) PI (Pro. Pers.) (Verbe conj.) (Ppa.) P2 
System 3 contains the rule: 

— PI Pro. pers. etre Ppa. P2 (and P2 is the direct object) ^ I 
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In order to know if P2 is the direct object, we need to identify some semantic 
information contained in the syntagm P2 and which it is necessary for checking 
the agreement of the past participle. Indeed, P2 could be “la nuit”, for example, 
which would not be a direct object. However, in this case, our simplification sys- 
tem would have detected that “la nuit” corresponds to an incidental complement 
of time and would have deleted it from the structure to keep only the minimal 
structure (la): PI (Pro.pers.) (Verbe conj.) (Ppa.) 

Thus, in our example, P2 “ ” is the direct object and conse- 

quently the rule ’e.’ of our algorithm is applied. The system checks that the past 
participle is in the invariable form (operator I), which means masculine, 

singular: 



• gender: masculine 

• number: singular 

Thus the past participle does not need to be corrected. 
Let us take another sentence from our text: 



(2) 




0 . 




0.0 


0- 


0 


0 


? 


0 


After applying the simplification system, we obtain 


the seven sub-sentences 


be checked and these are: 


(2a) 


0 




(2b) 


0 




(2c) 


0 




(2d) 


. 0- 




(2e) 


0 




(2f) 


0 




(2g) 


0 ■ 





For all these similar sub-sentences we have the same structure PI (Pro. Pers.) 
(Verbe conj.) (Ppa.) which corresponds to the general rule, where in fact is the 
direct object of the verb, and in consequence the system checks the agreement 
of the past participle with the subject . Thus, our second 

example will be corrected to give: 

(2Corrected) , s 

’ S s , s . _ 

s . s 

S 

With these results provided by our application, we see that systemic analysis 
is an efficacious methodology for solving difficult problems such as the automatic 
agreement of the French past participle, a problem that no other current checking 
system succeeds in correcting. 
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5 Conclusion 

Systemic analysis is useful because the decompositions to be done are not defi- 
nitely imposed by systemic analysis itself, whether from the beginning or subse- 
quently. On the contrary the different systems are decomposed (and thus built) 
according to the problems to be solved during the analysis. What systemic anal- 
ysis does provide is the structuring in terms of Super-system, System 1, System 
2, System 3. In our example, we have shown that the usual decomposition in 
terms of syntax, lexis, morphology is not really what is needed to solve prob- 
lems in language processing but on the contrary we have designed systems which 
include data from all these “layers” and only as and when needed. 
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Abstract. This paper*^ presents the application of the 2-step RSV 
(retrieval status value) merging algorithm to DIR environments. The 
reported experiment shows that the proposed algorithm is scalable and 
stable. In addition, it obtains a precision measure higher than the well 
known CORI algorithm. 



1 Introduction 

Usually, a Distributed Information Retrieval (DIR) system must rank document 
collections by query relevance, selecting the best set of collections from a ranked 
list, and merging the document rankings that are returned from a set of 
collections. This last issue is the so-called “collection fusion problem” [8,9], 
and it is the topic of this work. We propose an algorithm called 2-Step RSV 
[3, 4]. This algorithm works well in Cross Language Information Retrieval (CLIR) 
systems based on query translation, but the application of 2-Step RSV in DIR 
environments requires an additional effort: learning of collection issues such as 
document frequency, collection size and so on. On the other hand, since 2- 
Step RSV makes up a new global index based on query terms and the whole 
of retrieved documents, it makes possible the application of blind feedback 
at a global level by means of the DIR monitor, rather than a local level by 
means of each individual Information Retrieval (IR) engine. Previous works 
have researched into the application of Pseudo- Relevance Feedback (PRF) to 
improve the selection process of the best set of collections from a ranked list [5] . 
This work emphasizes the effectiveness of PRF applied to the collection fusion 
problem. Finally, a second objective is to study the stability of 2-Step RSV 
against weighting function variances in the local indices. 

The rest of the paper is organized as follows. Firstly, we present a brief 
revision of DIR problems. Section 2 describes our proposed method which is 
integrated into CORI model[I]. In section 3, we detail the experiments carried 
out and the results obtained. Finally, we present our conclusions and future lines 
of work. 
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2 The 2-Step RSV Algorithm 

The basic 2-step RSV idea is straightforward: given a query term distributed over 
several selected collections, their document frequencies are grouped together. In 
this way, the method recalculates the document score by changing the document 
frequency of each query term. Given a query term, the new document frequency 
will be calculated by means of the sum of the local document frequency of the 
term for each selected collection. Given a user query, the two steps are: 

1. The document pre-selection phase consists of searching relevant documents 
locally for each selected collection. The result of this previous task is a single 
collection of preselected documents (/' collection) as result of the union of 
the top retrieved documents for each collection. 

2. The re-indexing phase consists of re-indexing the global collection but 
considering solely the query vocabulary. Only the query terms are re-indexed: 
given a term, its document frequency is the result of grouping together the 
document frequency of each term from each selected collection. Finally, a 
new index is created by using the global document frequency, and the query 
is carried out against the new index. Thus for example, if two collections are 
selected, I\ and I 2 , and the term “government” is part of the query, then 
the new global frequency is dfi-^ (government) + dfi^{government). 

In this work, we have used OKAPI BM-25 [6] to create the global index for 
the second step of the 2-step RSV approach. The collection size, the average 
document length, the term frequency and the document frequency are required 
elements in order to calculate OKAPI BM-25. These elements -except the term 
frequency- are learned by means of [2] and 

[7] algorithms. Term frequency requires the document to be downloaded before 
2- step RSV creates the index at query time. Note that the merging process 
is created step by step. For example, the DIR monitor downloads two or three 
documents per selected collection, it applies 2-step RSV and shows the result 
to the user. If more documents are required by the user, then the DIR monitor 
downloads the next two or three documents per selected collection and so on. 
After some documents (no more than 10 or 20) have been downloaded and 
reindexed, the application of blind feedback is easy. Given the R top-ranked 
documents at global level, the Robertson-Spark Jones PRF approach [6]] is 
applied. In this work, PRF is applied by expanding the original query using 
the top ten terms obtained from the top ten documents. Then, the expanded- 
query is applied to reweigh every downloaded document. Note that the expanded 
query is only applied at a global and not a local level. 



3 Experiments and Results 

3.1 Experimental Methods 

The steps to run each experiment are the followings: 
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1. Create one index for each collection (OKAPI or a random index). 

2. Use CORI to score each collection for that query. 

3. When all collections have been scored, choose the most relevant using a 

algorithm or by selecting the N best scored. 

4. Run the query using the selected indices. 

5. In some experiments, apply pseudo-relevance feedback in each collection, 
running each expanded query against its index. 

6. Merge the document rankings using CORI or 2-Step RSV . 

7. In some experiments, apply again pseudo-relevance feedback using the 

index created by 2-Step RSV . 

8. Evaluate the quality of the results in terms of precision and recall. The 
precision measure used is the average precision at 5,10,20 and 100 documents, 
and the average precision at eleven coverage points. 

The parameters of each experiment are the set of queries used, whether or no 
we have used feedback and the fusion strategy. Another variable is the weighting 
function used with each index (always OKAPI in the multilingual scene). This 
means that there are some experiments where we have only used OKAPI, but 
in others we have used mixed indices since this scene shows in a real manner a 
distributed system: two collections that belong to the same distributed system 
cannot use the same weighting function. 

3.2 Test Collection Description 

The experiments are carried out using three partitions of the conferences TRECl 
and TREC2 and two sets of queries for the queries 51-100 and 101-150 of 
the TREC collections. The TRECl and TREC2 collections belong to the text 
published between 1987 and 1990 in various newspapers, news agencies and 
editorials. They contain more than two gigabytes of data, divided into 740.000 
documents. Over these thirteen collections we have made three tests, described 
in the 1: 

— TREC-1. The thirteen collections indexed with only one index. It shows 
the best case. 

— TREC- 13. Each collection of the thirteen has been indexed separately. 

— TREC-80. The original thirteen collections have been divided into eighty 
collections, and indexed separately. The procedure to create these eighty 
collections is the following: 

1. Each source (AP,DOE,FR, WSJ, ZIFF) has a number of subcollections, 
according to its size. Therefore, AP has 19 subcollections, DOE has 7, 
FR has 18, WSJ has 20 and ZIFF has 16. 

2. With a random value and according to its source each document has 
been copied in one of its subcollections. 

The queries used appear in the first two editions of the TREC conferences, 100 
queries, in total. We have used, in the experiments, only the title and description 
fields. 
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Table 1. Description of the collections sets 





# of docs. 


Size (in MB) 


Min. Avg. Max. 


Min. Avg. Max. 


TREC-1 


741.991 741.991 741.991 


2168 2168 2168 


TREC-13 


10.163 57.066 226.087 


33 159.23 260 


TREC-80 


2473 9273 32.401 


23 25,81 30 



3.3 Experiments Description 

Using the sets of indices, TREC-1, TREC-13 and TREC-80, we have run some 
experiments, according to the following parameters: 

— Weighting functions used in the indices. We decided to measure the stability 

of the proposed model, and for this purpose we have worked with two index 
sets, , and . With the first, all collections have been indexed 

with , weighting method. In the second case we have chosen a random 
weighting function from those of the ZPRISE system. The TREC-1 collection 
has been indexed using only OKAPI. 

— Collection selection method. In a distributed system it is necessary to weight 
each collection for a given query, and choose the most relevant. CORI allows 
us to value each collection, but does not say which judgement we should 
follow to choose more or fewer collections. [1] suggests using cluster methods 
or setting a fixed value of recovery collections. We have applied both; we 
have used clustering and also a fixed value N (with N=5,10,15 and 20). 

— Application of local pseudo-relevance feedback. Each librarian can expand 
automatically the query, with the increase of the local results in mind. 
Nonetheless, since the added terms are local, these are not known by the 
receptionist or the central system, so CORI and also 2-step RSV lose some 
of the possible local increase. Even so, it is very important to study how this 
popular method can affect in the final result [5]. The expansion has been 
made by adding the most relevant ten terms of the ten first documents to 
the original query. 

— Application of global pseudo-relevance feedback. Since the estimation of the 
2- step RSV generates a new index in a query time, it is possible to use this 
new index to apply PRF, although it is necessary to download the first N 
documents, to analyze them and to extract the terms with the most relevant 
power of these N documents. In this work we use the first ten documents. 
Therefore, this method introduces a low cost. 

Experiments without Query Expansion and Homogeneous Indices. In 

this section CORI and 2-step RSV are compared. All collections have been 
indexed using the same weighting method, OKAPI. We have applied neither 
local nor global query expansion. 

Tables 2 and 3 show the results obtained. The best results are from 2- 
step RSV, whose increase is better, in average precision terms, between 19,4%, 
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Table 2. DIR experiments without feedback and homogeneous indices (set TREC-13, 
queries 101-150) 



1 Fusion 


5-prec 


10-prec 


20-prec 


1 00-prec 1 Avg-prec I 


1 Top- 10 1 


CORI 


0,484 


0,416 


0,360 


0,268 


0,134 


2-Step RSV 


0,492 


0,478 


0,445 


0,348 


0,199 


Centralized 


0,492 


0,492 


0,444 


0,346 


0,194 



Table 3. DIR experiments without feedback and homogeneous indices (set TREC-80, 
queries 51-100) 



Fusion 1 5-prec 10-prec 20-prec 1 100-prec Avg-prec 


Top-20 


CORI 


0,261 


0,263 


0,293 


0,210 


0,071 


2-Step RSV 


0,488 


0,460 


0443 


0,318 


0,147 


Centralized 


0,556 


0,514 


0,492 


0,371 


0,210 



if we use the five first documents over TREC-13, and 107% if we use the first 
twenty documents over TREC-80. In this last case, in absolute terms, 2-step RSV 
improves CORI in 7,6 points, but CORI starts from a low average precision 
(0,07), so the increase looks better than it actually is. CORI obtains its best 
results when the ten first collections are selected. However its performance 
suddenly drops if the twenty first collections are selected. In this point 2-step 
RSV is more stable, because the more selected collections, the better the precision. 
This aspect is shown in figure 1, which shows the performance of 2-step RSV 
in relation to CORI. 




Fig. 1. Improvement of 2-step RSV over CORI (coll. TREC-80, queries 51-100). The 
base case (100%) shows CORI precision 
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Finally, in the comparison between CORI and 2-step RSV, and a centralized 
model, an interesting result appears. The precision obtained with 2-step RSV 
over 13 subcollections, using Top — 10, is better than the precision obtained with 
the centralized model, and therefore, it is proved that if the system does not use 
all collections, there is not always a loss of precision. The good performance and 
level obtained with TREC-13 is not the same with TREC-80. In this case the 
precision obtained is lower, although the most probable reason is that the ratio 
of the collections used is lower in comparison with the total available. In fact, 
if the twenty first collections are used, which represents 25% of the total, the 
average precision obtained is higher than 70% of the precision obtained with a 
centralized model. 

Experiments without Query Expansion and Heterogeneous Indices. 

In this section we have studied how the use of a random weighting function 
can affect each subcollection. While in the last section all the sub- 
collections have been indexed with the same weighting function OKAPI, in this 
one each subcollection has been indexed with a random function, available in 
the IR ZPrise system. Tables 4 and 5 show that the results obtained follow the 
same tendency as the previous section: 

— Using the test set TREC-13, 2-step RSV almost doubles the average precision 
of CORI, over the run selection procedures (Top-5, Top-10 and ). 

~ Using TREC-80, this difference is not so important, except if we use the 
twenty first collections. In this last case, CORI drops again, and the 
performance obtained is not even the third part of the performance obtained 
with 2-Step RSV . In the other cases the increase of the performance is about 
40%. 



Table 4. DIR Experiments without feedback and heterogeneous indices (TREC-13 set, 
queries 101-150) 



Fusion 1 5-prec 1 0-prec 20-prec 1 100-prec Avg-prec 


Top- 10 


CORI 


0,348 


0,254 


0,228 


0,208 


0,100 


2-Step RSV 


0,492 


0,480 


0,443 


0,350 


0,198 


Centralized 


0,492 


0,492 


0,444 


0,346 


0,194 



Table 5. Experiments without feedback and heterogeneous indices (TREC-80 set, 
queries 51-100) 



Fusion 5-prec 1 10-prec 1 20-prec 100-prec | Avg-prec 


Top-20 


CORI 


0,176 


0,214 


0,231 


0,167 


0,046 


2-Step RSV 


0,610 


0,619 


0,569 


0,377 


0,160 


Centralized 


0,556 


0,514 


0,492 


0,371 


0,210 
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2-step RSV and also CORI lose precision when a random method is introduced 
in the selection of the weighting function. This is almost inevitable because the 
local results obtained with each collection are in some experiments worse than 
with OKAPI. Nonetheless, this loss of precision is not the same with CORI as 
with 2-step RSV. If we focus on the results obtained over TREC-80, while 2- step 
RSV loses in general between 10% and 20%, CORI, with the inclusion of various 
weighting functions, loses more than 40%. This situation is shown in the figure 
2, which shows the quotient between the precision obtained with OKAPI and 
with the random functions. Using and also Top-10, while with 2-step 

RSV the performance is around 80% of the precision obtained with homogeneous 
indices, CORI is only around 60%. Therefore, it is possible to arm that 2-step 
RSV is a stable algorithm against the variation of the weighting functions used 
in each subcollection. 




Fig. 2. Impact of the use of heterogeneous indices in the performance of CORI and 
2-Step RSV (coll. TREC-80, queries 51-100, clustering). The base case (100%) shows 
the precision obtained with the algorithm using homogeneous indices (only OKAPI) 

3.4 Experiments with Query Expansion 

In this section we study the impact of the use of query expansion method based 
in local and global PRF: 

— Local pseudo-relevance feedback. Each librarian applies PRF locally with 
the purpose of increasing the results obtained over this library. 

— Global pseudo-relevance feedback. This case can only be applied to 2-step 
RSV. Since the receptionist generates a new global index, it is possible to 
apply PRF over this index. 

~ Local and global pseudo-relevance feedback. Finally, it is possible to apply 
PRF first for each librarian, and also for the receptionist later. 





The Merging Problem in Distributed Information Retrieval 



449 



Local PRF Experiments. As figures 3 and 4 show, the local feedback does 
not provide an increase in the 2-step RSV case. If we use CORI the situation is 
a little better over 13 collections. The use of the PRF over 80 collections makes 
the results even worse, but this is not always so. In every case the differences are 
very small (not more than two points) and the conclusion is that it is possible 
to assume that the PRF does not affect the final result, with both CORI and 
with 2-step RSV. 




Fig. 3. Local feedback impact (TREC-13, Top-10) 




Fig. 4. Local feedback impact (TREC-80, clustering) 



In systems that do not collaborate completely (the receptionist cannot access 
to the added terms of each librarian) this result was already known to CORI. 
[5] shows that the use of query expansion methods (in their case, 

) does not increase either the collections selection or the documents 
selection. A possible cause is the expanded query length, because this new length 
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makes the score normalization difficult. This reason cannot be applied to 2-step 
RSV . Some experiments in the last section show that the performance of the 
2-step RSV is not different if the length of the query is different. In the case of 
2-step RSV there may be two causes: 

“ The receptionist only works with the original query vocabulary, so the 
documents that are now relevant because of the query expansion, are not 
selected. 

— 2-step RSV does not use the local score obtained for each document. The 
sole relevant condition of a document to 2-step RSV is that this document 
belongs to the list given by the librarian and the query vocabulary that this 
document contains, and never the local score obtained. 

Experiments with Global PRF. Whether or not we use not local feedback, 
it is possible to apply query expansion methods, not using each collection as 
a single unit but using the index created in the second phase of the 2-step 
RSV method. The receptionist uses the expanded query, instead of the original 
one, in order to evaluate every document received from each local IR system. 
The computational cost in this case is not very high, if compared with the 
computational cost of a centralized system, because it only needs to analyze a 
small number of documents, in our experiments the ten first documents, so in 
general it is only necessary to download two or three documents per selected 
collection, using any procedure like the ones described in section 4. 



Table 6. DIR Experiments with global feedback and homogeneous indices (TREC-13 
set, queries 101-150) 



1 Fusion 


5-prec 


10-prec 


20-prec 


1 00-prec 1 Avg-prec | 


1 Top-10 1 


CORI 


0,484 


0,416 


0,360 


0,268 


0,134 


2-Step RSV 


0,492 


0,478 


0,445 


0,348 


0,199 


2-Step RSV Tglobal PRF 


0,432 


0,424 


0,432 


0,375 


0,232 


Centralized 


0,492 


0,492 


0,444 


0,346 


0,194 


Centralized-hPRF 


0,540 


0,526 


0,497 


0,418 


0,273 



Table 7 . DIR Experiments with global feedback and homogeneous indices (TREC-80 
set, queries 101-150) 



Fusion 


5-prec 


10-prec 20-prec 1 100-prec Avg-prec 


Top-20 


CORI 


0,274 


0,259 


0,272 


0,192 


0,067 


2-Step RSV 


0,488 


0,446 


0,408 


0,289 


0,131 


2-Step RSV -fglobal PRF 


0,480 


0,440 


0,431 


0,326 


0,155 


Centralized 


0,492 


0,492 


0,444 


0,346 


0,194 


Centralized-hPRF 


0,540 


0,526 


0,497 


0,418 


0,273 
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In tables 6 and 7 the results obtained for the query set 51-100 and homo- 
geneous indices are shown. 

The increase of the average precision introduced using global PRF in relation 
to the original 2-step RSV is quite signicant, around 15-20%. The lowest increase, 
TREC-13 case, is 16.6% (Top-10) and the highest is 21%, with TREC- 13 and 
. In this way the results of the experiments with TREC-80 are between 
17.4% (Top-5) and 20% (Top-20). The increase of the centralized model with 
PRF is 41%, much more than the 20% obtained with 2-step RSV . The reason is 
clear: the centralized model has access to documents of collections, so it is 
possible to include relevant documents which have not been selected previously. 
This is impossible with 2-step RSV , because this method only includes 
documents and collections, and only resorts documents selected previously, 
but never adds new documents. In any case, this situation could change if we 
run the expanded query over each selected collection, and then apply over these 
new results the original 2-step RSV algorithm. 

It is clear that the global PRF increases the average precision, but it does not 
always increase the selection of the first documents, and it is frequent to obtain 
worse precisions with global PRF compared with the results without PRF, when 
only the five or ten first documents are considered. This conclusion is shown in 
figure 5. 




Fig. 5. Global feedback impact (random indices, clustering) (I) 



The use of global PRF increases recall in general, because it introduces more 
relevant documents between the first thousands, but it does not increase the 
precision of the first documents. Is it always advisable to use global PRF?. As 
is usual in these cases, the answer depends on the user’s needs. In general the 
application of PRF worsens a little the ranking of the first 10 or 15 documents, 
and from this number the result increases. On the other hand, the computational 
cost of applying PRF is moderate but not null, because it needs to analyze the 
first documents in the query time. It is possible that this computational cost will 
be the reason to apply or not this method: a little better results in general but 
the user has to wait some more seconds. 
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4 Conclusions and Future Work 

This paper shows the application of 2-step RSV to DIR environments. We have 
focused on two questions: 

1. Collections with different weighting functions and the improvement of blind 
feedback applied by the DIR monitor at global level. The experiments about 
the first question show that 2-step RSV is robust against weighting functions 
variances. 2-step RSV and also CORI lose precision when a random method 
is introduced into the selection of the weighting function. The loss of precision 
with CORI is more than 40%, while with 2-step RSV it is between 10% and 
20%. 

2. Blind feedback is a useful technique whenever it is applied at global level 
rather than individually for each IR system. Global PRF applied with 2- 
step RSV increases the average precision, but it does not always increase the 
selection of the first documents, and it is frequent to obtain worse precisions 
with global PRF compared with the results without PRF, when only the 
five or ten first documents are considered. 

We can also test the improvement of the results using global PRF by sending 
expanded queries for each collection instead of recalculating the score of doc- 
uments received by means of the original query. The steps of the proposal 
architecture would be modified as follows: 

1. Receptionist receives the user query. Such query is sent to selected collections. 

2. Selected collections send a few documents to the receptionist. These doc- 
uments are used by the receptionist in order to expand the initial user query. 

3. The expanded query is sent to the selected collections. The documents 
received initially are discarded. 

4. The score of documents received from the collections is recalculated by using 
the global index generated by 2-step RSV method and the expanded query. 

5. Finally, the receptionist ranks the received documents. 

We hope that this approach will achieve an improvement similar to the one 
obtained using PRF in a centralized IR system. This additional step will have a 
computational cost that will be also studied. 
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Abstract. A simple, robust sliding- window part-of-speech tagger is 
presented and a method is given to estimate its parameters from an un- 
tagged corpus. Its performance is compared to a standard Baum- Welch- 
trained hidden-Markov-model part-of-speech tagger. Transformation into 
a finite-state machine — behaving exactly as the tagger itself — is demon- 
strated. 



1 Introduction 

A large fraction (typically 30%, but varying from one language to another) of 
the words in natural language texts are words that, in isolation, may be as- 
signed more than one morphological analysis and, in particular, more than one 
part of speech (PoS). The correct resolution of this kind of ambiguity for each 
occurrence of the word in the text is crucial in many natural language process- 
ing applications; for example, in machine translation, the correct equivalent of 
a word may be very different depending on its part of speech. 

This paper presents a sliding- window PoS tagger (SWPoST), that is, a system 
which assigns the part of speech of a word based on the information provided by 
a fixed window of words around it. The SWPoST idea is not new; however, we are 
not aware of any SWPoST which, using reasonable approximations, may easily 
be trained in an unsupervised manner; that is, avoiding costly manual tagging 
of a corpus. Furthermore, as with any fixed-window SWPoST, and in contrast 
with more customary approaches such as hidden Markov models (HMM), the 
tagger may be implemented exactly as a finite-state machine (a Mealy machine). 

The paper is organized as follows: section 2 gives some definitions, describes 
the notation that will be used throughout the paper, and compares the sliding- 
window approach to the customary (HMM) approach to part-of-speech tagging; 
section 3 describes the approximations that allow a SWPoST to be trained in an 
unsupervised manner and describes the training process itself; section 4 describes 
how the tagger may be used on new text and how it may be turned into a finite- 
state tagger; section 5 describes a series of experiments performed to compare 
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the performance of a SWPoST to that of a HMM tagger and to explore the size 
of the resulting finite state taggers (after minimization); and, finally, concluding 
remarks are given in section 6. 



2 Preliminaries 

Let r = {71,72, ■ ■ ■ )7|r|} be the for the task, that is, the set of PoS tags a 

word may receive and W = {wi,W2 , . . .} be the vocabulary of the task. A parti- 
tion of W is established so that Wi = Wj if and only if both are assigned the same 
subset of tags. Each of the classes of this partition is usually called an 

. It is usual [1] to refine this partition so that, for high-frequency words, 
each word class contains just one word whereas, for lower-frequency words, word 
classes are made to correspond exactly to ambiguity classes (although it is also 
possible to use one- word classes for all words or to use only ambiguity classes), 
which allows for improved performance on very frequent ambiguous words while 
keeping the number of parameters of the tagger under control. 

Any such refinement will be denoted as A = (cti, CT2, . . . , cr\s\\ where Oi are 
word classes. In this paper, word classes will simply be ambiguity classes, without 
any refinement. We will call T : S ^ 2 ^ the function returning the set T{a) of 
PoS tags for each word class cr. 

The part-of-speech tagging problem may be formulated as follows: given a 
text ru[l]'u;[2] . . ,w[L] G IU+, each word w[t] is assigned (using a lexicon, a mor- 
phological analyser, or a guesser) a word class a[t] G A to obtain an 

text ct[ 1 ]ct[ 2] . . ,a[L] G A+; the task of the PoS tagger is to obtain a tagged 
text 7[1]7[2] . . .^[L] G A+ (with j[t] G T{a[t])) as correct as possible. 

Statistical part-of-speech tagging looks for the , . tagging of an am- 

biguously tagged text (t[1]ct[2 ] . . . cr[L]: 

7*[1] ...7*[L] = argmax P{'j[l] . . ,'j[L]\a[l] . . . a[L]) (1) 

■y[t]eT{(T[t]) 

which, using Bayes’ formula, becomes equivalent to: 

7*[1] ...7*[L] = argmax ^5(7(1] . . . 7[T])PL(cr[l] . . . cr[L]|7[l] . . . 7[L])) (2) 

7[qeT(cr[i]) 

where Ps(7[l] . . . 7[L]) is the probability of a particular tagging (syntactical 
probability) and Pl(o"[1 ] ■ • ■ o’l-bjlyil] ■ • ■ l[L]) is the probability of that particular 
tagging generating the text cr[l] . . . a[L] (lexical probability). In hidden Markov 
models (HMM) [5], these probabilities are approximated as products; the syn- 
tactical probabilities are modeled by a first-order Markov process: 

t=L 

Ps( 7[1]7[2] . . .7[L]) = nPS'(T'[^ + (3) 

t=o 

where 7(0] and 7[L+ 1] are fixed delimiting tags (which we will denote as and 
will usually correspond to sentence boundaries); lexical probabilities are made 
independent of context: 
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t=L 

Pdo[l]a[2]...a[LM\U2]...^[L]) = \{pL{o[mt]). (4) 

t=i 

The number of trainable parameters for such a tagger is (|r'| + |T'|)|T|. Tag- 
ging (searching for the optimal 7* [1(7* [2] . . .^*[L\) is implemented using a very 
efficient, left-to-right algorithm usually known as Viterbi’s algorithm [1,5]. If 
conveniently implemented, Viterbi’s algorithm can output a partial tagging each 
time a nonambiguous word is seen, but maintains multiple hypotheses when read- 
ing ambiguous words. HMMs may be trained either from tagged text (simply 
by counting and taking probabilities to be equal to frequencies) or from un- 
tagged text, using the well-known expectation-maximization backward- forward 
Baum- Welch algorithm [5,1]. 

In this paper we look at tagging from a completely different perspective. In- 
stead of using the inverted formulation in eq. (2) we approximate the probability 
in eq. (1) directly as: 



t=L 

P(7[1]7[2] . ..l[L]\a[l]a[2] . ..a[L]) = J] P(7Wiq-) W^MC'(+)[t]) (5) 



where 

C(_) [t] = a[t - iV(_)]cr[t - iV(_) -k 1] • • • cr[t - 1] 
is a ofsizefV(_), 

[t] = cr[t l]cr[f -|- 2] • • • cr [t -|- A^(_|_)] 

is a of size and (t[— iV(_) + 1],(j[— iV(_) + 2],...,(j[0] and 

a[L + 1], cr[i -k 2], . . . a[L + fV(_|_)] are all set to a special delimiting word class 
such that T{a^) = {7#}, e.g., one containing the sentence-boundary marker 
tag 7^ e r. This method is local in nature; it does not consider 

any context beyond the window of -k -k 1 words; its implementation is 
straightforward, even more that of Viterbi’s algorithm. The main problem is the 
estimation of the probabilities p(7[t] |C(_) [t](r[t]C'(+) [t]). From a tagged corpus, 
these probabilities may be easily counted; in this paper, however, we propose a 
way of estimating them from an untagged corpus. Another problem is the large 
number of parameters of the model (worst case 0(|Vj^(+)+'^(-)+^jTj)); we will 
discuss a way to reduce the number of parameters to just 0(| jT]) 

and show that, for many applications, fV(_) = 1V(+) = 1 is an adequate choice. 



3 Training from an Untagged Corpus 

The main approximation in this model is the following: we will assume that the 
probability of finding a certain tag ^[t] in the center of the window depends 
only on the preceding context C(_) [t] and the succeeding context C(+) [t] but not 
on the particular word class at position t, cr[<]; that is, the probability that a 




Unsupervised Training of a Finite-State Sliding-Window 457 



word receives a certain label depends only . on the word (the con- 

text determines the probabilities of each label, whereas the word just selects 
labels among those in T{a[t])). We will denote this probability as PC(_)7C(+) for 
short (with the position index [t] dropped because of the invariance). The most 
probable tag 7*[t] is then 



7*[t] = argmax PCr_-.[t]-rCr+-,[t], (6) 

7eT(cr[t]) 

that is, the most probable tag in that context among those corresponding to 
the current word class. The probabilities PC(_)7C(+) are easily estimated from a 
tagged corpus (e.g., by counting) but estimating them from an untagged corpus 
involves an iterative process; instead of estimating the probability we will esti- 
mate the count hC(_) 7C(+) which can be interpreted as the effective number of 
times that label 7 would appear in the text between contexts C*(_) and 
Therefore, 



p(7|C'(_)CrC'(+)) 



^cT(_)<T<T(+)h-C(_)7C(+) if 7 G ^(rr) 

0 otherwise ’ 



where is a normalization factor 



k 






^C(_) 7 'C(+) 

\7'eT(<T) 



-1 



( 7 ) 



(8) 



Now, how can the counts nc^_^[t]'yC(^^[t] be estimated? If the window proba- 
bilities p(7|C'(_)[t]crC'(_|_)[t]) were known, they could be easily obtained from the 
text itself as follows: 



frC(_) 7 C(+) = riC(_)^C(+)T( 7 |C'(-) 0 -C'(-h)), ( 9 ) 

( 7 : 7 GT( ct ) 



where nC(_)crC(+) is the number of times that label a appears between contexts 
C(_) and C(_|_); that is, one would add p(7|C'(_)crC'(+)) each time a word class a 
containing tag 7 appears between C(_) and <!?(+). Equations ( 7 ) and ( 9 ) may be 
iteratively solved until the hc(_)7C(+) converge. For the computation to be more 
efficient, one can avoid storing the probabilities p(7|C'(_)(jC'(_|_)) by organizing 
the iterations around the nc^_-^-yC\+) as follows, by combining eqs. ( 7 ), (8), and 
( 9 ) and using an iteration index denoted with a superscript [k], 

^c|_)7C(+) = ^C(_)7C(+) f ^C(-)7'C'(+)) > 

cr:7eT((T) \7'6 T(<t) / 



where the iteration may be easily seen as a process of successive corrections to 
the effective counts hc(_)7C(+)- A possible initial value is given by 



( 7 : 7 GT(cr) 



1 



^C(_) 7 C(+) - 



( 11 ) 




458 



E. Sanchez-Villamil et al. 



that is, assuming that, initially, all possible tags are equally probable for each 
word class. 

Equations (10) and (11) contain the counts ncj_jo-C(+) which depend on 
+ fV(_) + 1 word classes; if memory is at a premium, instead of reading 
the text once to count these and then iterating, the text may be read in each 

\k] 

iteration to avoid storing the nC(_)<rC(+)j and the computed 

. . Iterations proceed until a selected convergence condition has been 
met (e.g. a comparison of the )7C(+) respect to the )W(+>’ 
completion of a predetermined number of iterations) . 



4 Tagging Text: A Finite-State Tagger 

Once the hc(_)7C(+) have been computed, the winning tag for class a in context 
eq. (6), may be easily computed for all of the words in a text. 
Unlike with HMM [2] , a sliding window PoS tagger may be turned exactly into 
a finite-state transducer [6]; in particular, into a Mealy machine with transitions 
having the form 

a[t - iV(_)](T[t - A-) + 1] • • • cr[t -I- - 1] ^ 

cr[t — ^(—) — ^(—) “k 2] • • • cr[t 

This Mealy machine reads a text cr[l] . . . <j[L\ word by word and outputs 
the winner tag sequence 7* [1] . . . 7* \L] with a delay of words. The resulting 
transducer has, in the worst case, 0(|T’|'^(+)+^(-)) states and 0(|£'|'^(+)+^(-)+^) 
transitions, but it may be minimized using traditional methods for finite-state 
transducers into a compact version of the sliding window PoS tagger, which takes 
into account the fact that different contexts may actually be grouped because 
they lead to the same disambiguation results. 



5 Experiments 

This section reports experiments to assess the performance of sliding-window 
part-of-speech using different amounts of context, compares it with that of cus- 
tomary Baum- Welch-trained HMM taggers [1], and describes the conversion of 
the resulting SWPoST taggers into finite-state machines. 

The corpora we have used for training and testing is the Penn Treebank, 
version 3 [4,3], which has more than one million PoS-tagged words (1014377) 
of English text taken from . The word classes E of the 

Treebank will be taken simply to be ambiguity classes, that is, subsets of the 
collection of different part-of-speech tags (T). The Treebank has 45 different 
part-of-speech tags; 244261 words are ambiguous (24.08%). 

The experiments use a lexicon extracted from the Penn Treebank, that is, 
a list of words with all the possible parts of speech observed. The exact tag 
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given in the Treebank for each occurrence of each word is taken into account 
for testing but not for training. However, to simulate the effect of using a real, 
limited morphological analyser, we have filtered the resulting lexicon as follows: 

— only the 14276 most frequent words have been kept, which ensures a 95% 
text coverage (i.e, 5% of the words are unknown). 

— for each word, any part-of-speech tag occuring less than 5% of the time has 
been removed. 

Using this simplified lexicon, texts in the Penn Treebank show 219 ambiguity 
classes. Words which are not included in our lexicon are assigned to a special 
ambiguity class (the class) containing all tags representing parts of speech 
that can grow (i.e. a new word can be a noun or a verb but hardly ever a 
preposition).^ 

In order to train the taggers we have applied the following strategy, so that 
we can use as much text as possible for training: the Treebank is divided into 20 
similarly-sized sections; a leaving-one-out procedure is applied, using 19 sections 
for training and the remaining one for testing, so that our results are the average 
of all 20 different train-test configurations. Figures show the average correct- 
tag rate only over ambiguous words (non- ambiguous words are not counted as 
successful disambiguations) . 



5.1 Effect of the Amount of Context 

First of all, we show the results of a SWPoST using no context (A(_) = A(+) = 0) 
as a baseline, and compare them to those of a Baum- Welch-trained HMM tagger 
and to random tagging. As can be seen in figure 1, the performance of the 
SWPoST without context is not much better than random tagging. This happens 
because without context the SWPoST simply delivers an estimate of the most 
likely tag in each class. We can also observe that the HMM has an accuracy of 
around 61% of ambiguous words. 

In order to improve the results one obviously needs to increase the context 
(i.e., widen the sliding window). As a first step we show the results of using a 
reduced context of only one word either before (fV(-) = l,iV(+) = 0) or after 
(A(_) = 0,iV(+) = 1) the current word. Figure 2 shows how that even using 
such a limited context the performance is more adequate. The number or train- 
able parameters of the SWPoST in this case is OdUHTl), slightly less than the 
OdUllTl -I- |Tp) of the HMM tagger. 

There is a significant difference between using as context the preceding (A(_) = 
1 and A(_|_) = 0) or the succeeding (fV(_) = 0 and = 1) word. The cogni- 
tive origin of this difference could be due to the fact that when people process 
language they tend to build hyphotheses based on what they have already heard 
or read which are used to reduce the ambiguity of words as they arrive. 



Our open class contains the tags CD, JJ, JJR, JJS, NN, NNP, NNPS, RB, RBR, RBS, UH, 
VB, VBD, VBG, VBN, VBP, and VBZ. 



1 




Accuracy (100 - Word Error Rate) n Accuracy (100 - Word Error Rate) 
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Comparison between an HMM tagger, the SWPoST with no context (A^(-) = 
0) and a random tagger 




Fig. 2. Comparison between an HMM tagger and the SWPoST using A^(-) = 1 and 
N(_|_)=0 (preceding) and using A^(_)=0 and A^(+) = l (succeeding) 
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The next step is increasing a bit more the size of the context until having two 
context words. In this case we have three different possibilities: using the two 
immediately preceding words = 2 and = 0), using one preceeding and 

one succeeding word (.^(-) = 1 and .^(+) = 1), and using two succeeding words 
= 0 and = 2). We can see the results in figure 3. The performance of 
the SWPoST is now much better than the HMM tagger when using a context of 
iV(_)=l and W(_|_) = l. However when using the two succeeding words the results 
are worse than with the HMM tagger, and the performance of the SWPoST with 
iV(_)=2 and iV(+)=0 is about as good as that of an HMM tagger. 




Fig. 3. Comparison between an HMM tagger and the SWPoST with using A^(_)=2 and 
(2 preceding) and using M(_)=l and M(+)=l (1 prec. and 1 sue.) and A^(_)=0 
and M(+)=2 (2 succeeding) 



Finally, we tried increasing the context a bit more, until using three context 
words in all possible geometries, but the results were not as good as we expected 
(actually worse) due to the fact that the corpus is not large enough to allow the 
estimation of OlTHIfp) parameters. 

The whole set of figures shows that SWPoST training usually converges after 
three or four iterations, which makes training very efficient in terms of time. 

5.2 Finite-State Sliding-Window PoS Tagger 

Once we have analysed the performance of the SWPoST we study its trans- 
formation into an equivalent finite-state transducer (FST). Given that the best 
results reported in the previous section correspond to using the context 
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W(-) = N(+) = 1 



W(.) = 1 W(+) = 0 



W(.) = 0 W(+) = 1 



W(-) = N(+) = 0 



Fig. 4. Fallback strategy 



and 7V(+)=1, we build a FST that has a decision delay of 1 time unit, with 
transitions of the form 



a[t — l]cr[t] 

Many of these transitions correspond to contexts that have never or hardly 
ever been observed in the corpus. To improve the accuracy of the tagger, a fall- 
back strategy was applied; this strategy uses SWPoST with smaller contexts 
trained on the same corpus to define the output of these unseen transitions. Fig- 
ure 4 shows the order of preference of the fallback strategy: if = 1, fV(+) = 1 
fails, the next best tagger iV(_) = 1,^(+) = 0 is used; if this fails, = 

0, = 1 is used, etc. The resulting FST has a slightly improved performance, 

reaching 67.15% accuracy for ambiguous words. 

5.3 Minimization of the Finite-State SWPoST 

The FST created in this way has a large number of states; customary finite-state 
minimization may be expected to reduce the number of states and therefore 
reduce memory requirements. The algorithm to build the FST generates in our 
case 48400 states (|T’p) and 10648000 (IFII^) transitions. After minimization the 
FST is reduced to 22137 states and 4870140 transitions. Given the large amount 
of ambiguity classes, minimizing to about half the size is not far from what we 
expected. 



6 Concluding Remarks 

We have shown that, as commonly-used HMM taggers, simple and intuitive 
sliding-window part-of-speech taggers (SWPoST) may be iteratively trained in 
an unsupervised manner using reasonable approximations to reduce the number 
of trainable parameters. The number of trainable parameters depends on the 
size of the sliding window. Experimental results with the Penn Treebank show 
that the performance of SWPoST and HMM taggers having a similar number of 
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trainable parameters is comparable. The best results, better than those of HMM 
taggers, are obtained using a SWPoST using a context of one preceding and one 
succeeding word, for a worst-case total of 2178000 parameters (with the HMM 
tagger having only 11925). The SWPoST can be exactly implemented as a finite- 
state transducer which, after minimization, has 22137 states and 4870140 tran- 
sitions. Furthermore, the functioning of SWPoST is simple and intuitive, which 
allows for simple implementation and maintenance; for instance, if a training 
error is found it is easy to manually correct a transition in the resulting FST. 

We are currently studying ways to reduce further the number of states and 
transitions at a small price in tagging accuracy, by using probabilistic criteria to 
prune uncommon contexts which do not contribute significantly to the overall 
accuracy. 
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Abstract. In this paper, we focus on text categorization model by unsupervised 
learning techniques that do not require labeled data. We propose a feature 
learning bootstrapping algorithm (FLB) using a small number of seed words, in 
that features for each of categories could be automatically learned from a large 
amount of unlabeled documents. Using these learned features we develop a new 
Naive Bayes classifier named NB_FLB. Experimental results show that the 
NB_FLB classifier performs better than other Naive Bayes classifiers by 
supervised learning in small number of features cases. 



1 Introduction 

A growing number of statistical classification methods and machine learning 
techniques have been applied to text categorization in recent years, such as Rocchio*’^, 
SVM^, Decision Tree"^, Maximum Entropy model^. Naive Bayes^. Most categorization 
systems make use of a training collection of documents to predict the assignment of 
new documents to pre-defined categories. However, when applied to complex 
domains with many classes, these methods often require extremely large training 
collection to achieve better accuracy. But creating these labeled data is very tedious 
and expensive, because labeled documents should be labeled by hand. In other words, 
obtaining labeled data is difficult to the contrary unlabeled data is readily available 
and plentiful. 

In this paper, we mainly focus on using unsupervised learning technique in text 
categorization that does not require labeled data. This work is organized as follows. In 
section 2, we introduce the scheme of the unsupervised text categorization classifier. 
And a feature learning algorithm based on bootstrapping is proposed. In section 3, 
using these learned features we develop a new Naive Bayes classifier. And evaluation 
environment and experiments are given. 



2 Unsupervised Text Categorization Scheme 

In the machine learning literature, a large number of clustering algorithms have been 
extensively studied to handle the classification task, either directly or after minor 
modification. These include SOM -based models such as WEBSOM^ and LabelSOM®, the 
ART-based models such as ARTMAP^ and ARAM*°, as well as various hybrid models. 
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Castelli and Cover^* showed in the theoretical framework that unlabeled data 
can be used in some settings to improve classification, although it is 
exponentially less valuable than labeled data. Youngjoong Ko and Jungyun Sco^^ 
proposed a text categorization method based on unsupervised learning. The 
method divides the documents into sentences, and categorizes each sentence using 
keyword lists of each category and sentence similarity measure. In essence, most 
unsupervised learning algorithms applied to classification form a hybrid model. 
The hybrid involves the topology-preserving approximation of the primitive 
training domain via unsupervised learning, followed with the nearest-neighbor- 
like predictions of the testing inputs. 

In this paper, we propose a new text categorization method by unsupervised 
learning. Without creating training labeled documents by hand, it automatically 
learns some features for each category using a small set of seed words and a large 
amount of unlabeled documents. Using these features a new Naive Bayes classifier 
is developed. 

In this scheme our proposed system consists of three modules as shown in figure I : 
a module to preprocess collected unlabeled documents, a module to learn features for 
each category, and a module to classify text documents. 




Fig. 1. Architecture for the unsupervised text categorization system 
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2.1 Preprocessing 

The first step in text categorization and feature learning is to transform collected 
documents into a representation suitable for the feature learning algorithm and the 
classification task. The preprocessing procedure consists of the following steps: 

• Remove all HTML tags and special characters in the collected documents. 

• Segment the contents of the documents into sentences, and the sentences are 
segmented into words. 

• Remove stopwords from sentences. The stopwords are frequent words that carry 
no content information, such as pronouns, prepositions, conjunctions. 

• Extract the rest words as candidate words. 

2.2 Feature Learning Algorithm 

The other central problem in statistical text categorization is the high dimensionality 
of the feature space. Standard categorization techniques can not deal with such a large 
feature set, since processing is extremely costly in computational terms, and the 
results become unreliable due to the lack of sufficient training data. Hence, we need 
feature selection techniques to reduce the original feature set. Feature selection 
methods attempts to remove non-informative words from documents in order to 
improve categorization effectiveness and reduce computational complexity. In the 
paper^^ a thorough evaluation of the five feature selection methods: Document 
Frequency Thresholding, Information Gain, CHI -statistic. Mutual Information, and 
Term Strength are given. In their experiments, the authors found the three first 
methods to be the most effective. 

Bootstrapping is a useful machine learning technique applied in some NLP 
tasks^"^’^^’^^’^^. In this paper, we propose a new feature learning algorithm based on 
bootstrapping. In figure I , the Feature Learner is the key module of feature selection 
procedure, which iteratively learns new features from a large amount of unlabeled 
corpus using a small set of seed words. The flow chart of the feature learning module 
is illustrated in figure 2. 

The learning procedure based on Bootstrapping is described as follows: 

• Initialization: Use a small number of seed words initialize feature set 

• Iterate Bootstrapping: 

> Candidate Feature Learner: Learn some new features as candidate features 
from unlabeled data 

> Evaluation: Score all candidate features, and select the best features (top 10) 
as new seed words, and add them into feature set. 

2.2.1 Initialization 

Seed words are some important features of a category. Table 1 show seed words for 
four categories given in our experiments. There are 10 seed words for each category. 
In initialization procedure, seed words of each category are added into the feature set 
of corresponding category. 
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Fig. 2. Flow chart of the feature learning module 



Table 1. Seed words for four categories 



Category 

(finance) 

(military) 

(sports) 



(law) 



Seed words 

jl’iM(stock) ^K^(finance) 1^W((loan) lE^(stock) 

M^i(finance and economics) ^tf(bank) 'k^'^(tax) 

^['-KKforeign exchange) m^fir(investment) (stock market) 
'^^(military) "^^(weaponry) ¥PAfat7wy) i*ji 4 ^(war) M^K(army) 
^i^‘M(firearm) J\.(army-man) ‘^\K(military area) 

^Ml^l&inuclear tests) ^M^^(nuclear disarmament) 

(sports) )&^(player) Si^ifootball) WiM(league) 
iST^(athletics) 'A'|llPA(C/!/Ma team)^%^^^(toumament) 
Pi(player)'fii:^(final) Wi^^icoach) 

'ii.W(law) iAPA(coMrt) Wf^(lawyer) )fi-Vt:(lawsuit) '^i‘\-(low case) 
M^(crime) ihH(execute the law) 'iit%\\(legal system) 
i&)/i:(transgress) ^^"^(law ojficer) 



2 . 2.2 Iterate Bootstrapping 

We can regard bootstrapping as iterative clustering. Our bootstrapping algorithm 
begins with some seed words for a category. From the set of candidate words, we can 
learn some candidate features using these seed words. These learned features are of 
the same category as the seed words. We score all the candidate features, and the best 
of the features are then added into the feature set that are used to learn new candidate 
features, and the process repeats. 
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2. 2. 2.1 Evaluation Methods 

In this paper, we use two evaluation methods: RlogF metric (M-method) and T score 
(T-method). 

1) M-method 

We score each word with the RlogF metric*^ previously used by Riloff. The score of a 
word is computed as: 

M{w.) = Log.,F{w,,X)xR. ( 1 ) 

Where F(wi,X) is the frequency of co-occurrence of word Wi and X (set of seed 
words) in the same sentence, F(Wj) is the frequency of W; in the corpus, and 
R,=F(w„X)/F(wO. 

The RlogF metric tries to strike a balance between reliability and frequency: R is 
high when the word is highly correlated with set of seed words (the category), and F 
is high when the word and X highly co-occur in the same sentence. In this paper, if Wi 
wants to be a feature, both of the two conditions must be satisfied: 1) Ri of W; should 
be greater than Rmini 2) Mi of Wi should be greater than M^n. 

2) T-method 

In this paper, we define a computation formula for T score evaluation method (T- 
method) described as: 

Piw,,X)-P{w,)P{X) (2) 

\ P(w„X) 

V N 

Where P(Wi,X) is the probability of co-occurrence of word Wi and X (set of seed 
words) in the same sentence, P(wi) is the probability of occurrence of Wi, P(X) is the 
probability of occurrence of X as a class. N is the total number of sentences in corpus. 
More higher T(wi) of word Wi, the more relevant it is to corresponding category. 

2.2.2.2 Feature Learning Bootstrapping Algorithm 

Figure 3 outlines our feature learning bootstrapping algorithm (FLB), which 
iteratively learns features from a large amount of unlabeled documents when some 
seed words are given. And the new features are identified as new seed words by 
Feature Learner based on both the original seed words and the new seed words. 
These new learned features are added into the feature set. Then the iteratively learning 
process repeats. 

In Figure 3, we define that C denotes a large size of unlabeled data (segmented 
sentences), W is a set of candidate words, Wj is a set of candidate feature, K is a set 
of hand-selected seed words as showed in Table 1 . F denotes the feature set. E denotes 
score of a candidate feature using M-method or T-method. 

2.2.3 Experiments of Feature Learning Bootstrapping Algorithm 

We use the corpus from 1996-1998 People’s Daily (includes articles of 30 months) 
as unlabeled data. We name the corpus as PDC. The data set contains 1,630,000 
sentences. We use the toolkit CipSegSDK^^ for text segmentation. In the experiments. 
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Input: C,W,K 
Initialization: 

Put all k in K into F 
Rmin = R_Init_ Value; 

Features Learner Bootstrapping: 

a) Select w whose R(w) > from W, and add w to Wi 

b) If Wj (z <I> , Rmin -=R_Step, go to a) 

c) Score all words in Wi 

d) F_New = the words whose E is greater than E,mn 

e) Put all f in E_New into S 

f) E_Best = the top 10 scoring words from S not already in F 

g) Put all f in F_Best into F, and remove all f from W 

h) If total number of learned features is larger than predefined number, then 
STOP, otherwise Go to a) continue. 

Output: F 



Fig. 3. FLB Algorithm 

to evaluate performance of feature learning based on two evaluation methods, we 
select four categories to be tested, including finance, military, sports and law. Ten seed 
words for each category are showed in Table 1. In this experiment we use the 
following setting: 

Fmm=10, R_Init_Value=0.5, R_Step=0.05, Mmin=log 2 ( 10)*0.5, T„,i„=2.576, 

At each iteration, we select 10 new seed words. According to human judgments, 
figure 4 shows relationship between loop number of iterative learning and average 
precision of features for four categories using M-method and T-method. 




82% 



78% 
g 76% 

£ 74% - - 
72% -- 
70% 







I I I I I I I I I I I I I I I I I I I 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

Loop Number 

Fig. 4. Relationship between loop number and average precision 



Experiment results show that M-method is better than T-method once the number 
of learned features is larger than 120. So we use M-method as the evaluation method 
for feature learning technique in our text categorization system. 
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3 Text Classifier 



3.1 Modification of the Multinomial Naive Bayes Classifier 



We use naive Bayes for classifying documents. We only describe multinomial naive 
Bayes briefly since full details have been presented in the paper^. The basic idea in 
naive Bayes approaches is to use the joint probabilities of words and categories to 
estimate the probabilities of categories when a document is given. Given a document 
d for classification, we calculate the probabilities of each category c as follows: 



P(c I d) = 



P(c)P(d I c) 



= P 



in 

(on 



/=1 



Pit. I 
N(t, I d)\ 



(3) 



Where N(tjid) is the frequency of word r, in document d, T is the vocabulary and ITI 
is the size of T, f is the i* word in the vocabulary, and PCflc) thus represents the 
probability that a randomly drawn word from a randomly drawn document in 
category c will be the word f. In our naive Bayes classifier, the probability P(tilc) is 
estimated as: 



P(t,lc)= „ (4) 

^ M (t ., X J + 0.1 I r I 

,/=i 

Where Xc denotes the set of seed words for category c. M(ti, Xc) denotes value of 
RlogF between h and Xc. Because we have not labeled training collection, so we 
suppose that for all categories probability P(c) in formula (3) is given equal value. 

3.2 Experiments 

3.2.1 Performance Measures 

We use the conventional recall, precision and FI to measure the performance of the 
system. For evaluating performance average across categories, we use the micro- 
averaging method. FI measure is defined by the following formula: 

P ^ ^rp (5) 

* r + p 

Where r represents recall and p represents precision. It balances recall and 
precision in a way that gives them equal weight. 

3.2.2 Experimental Setting 

The NEU_TC data set contains web pages collected from web sites. The pages are 
divided into 10 categories: IT, finance, military, education, sports, recreation, house, 
law, profession and tour, which consists of 14,405 documents. In the experiments, we 
use 3000 documents (300 documents for each category) for training NB by supervised 
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learning. The rest of data set is used to test, and it consists of 11,405 documents. We 
construct three text categorization models. The first model is naive Bayes by 
supervised learning named NB_SL in which Features is selected by ranking features 
according to their information gain (IG) with respect to the category. The second 
model is naive Bayes by unsupervised feature learning algorithm as NB_FLB which 
does not require labeled training documents, and uses formula (4) to estimate the 
probability P(tjlc) in the formula (3). The third model is naive Bayes by unsupervised 
feature learning algorithm as NB_FLB_SL which uses learned features same as in 
NB_FLB, and uses all labeled training collection to estimate the probability PCflc). 

3.2.3 Experimental Results 

In our experiments, we compare the performance of NB_FLB with NB_SL and 
NB_FLB_SL. We use all the sentences to learn new features and train NB_FLB and 
NB_FLB_SL classifier, and vary number of labeled documents to train NB_SL 
classifier. In the figure 5, NB_SL100, NB_SL200, NB_SL300, NB_SL500 and 
NB_SL3000 refer to number of labeled training documents for NB_SL classifier are 
100, 200, 300, 500, 3000, respectively. We use 3000 training documents to estimate 
the probability P(tilc) for NB_FLB_SL classifier. 




Fig. 5. Experimental Results of NB_FLB, NB_SL and NB_FLB_SL classifier 

It is interesting to note that in small number of features cases, NB_FLB classifier 
always performs better than NB_SL classifier with IG. Even using 3000 documents 
for training, NB_SL with IG provides a FI of 2.5% lower than NB_FLB when the 
number of features is 200. However, Figure 5 shows that once increasing the number 
of features, the performance of NB_FLB starts to decline and the performance of 
NB_SL with IG increases continually. The performance of NB_FLB_SL classifier 
using labeled training data is better than other models. 

3.2.4 Discussion 

First, we investigate the reason why adding more features the performance of 
NB_FLB classifier is declined. Because the set of later learned features include more 
“irrelevant words” than in the set of previous learned features. Here “relevant words” 
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for each category refer to the words that are strongly indicative to the category on the 
basis of human judgments. For iterative learning based on unreliable learned features, 
more and more irrelevant words will be learned as features for the category later. And 
these words hurt performance of NB_FLB classifier. 

Second, we investigate the reason why NB_FLB outperforms NB_SL in small 
number of features cases. We collect the features for each category in both NB_FLB 
and NB_SL classifiers, and sort them by their score. We find NB_FLB obviously has 
more relevant words than NB_SL with IG in top 200 feature set by human judgment. 

Third, we investigate the reason why the performance of NB_FLB_SL classifier is 
better than other models. There are two reasons. 1) Parameters of naive Bayes model 
estimated by labeled training collection are better than by unlabeled collection. 2) As 
above mentioned, according to human judgments, these features learned by FLB 
algorithm are more relevant to categories than by IG in small number of features 
cases. We find that some learned features could be considered as Domain Associated 
Words (DAWs). We believe DAWs are very useful to text categorization. 



4 Conclusion 

In this paper, we proposed a text categorization method by unsupervised learning. 
This method uses bootstrapping learning technique to learn features from a large 
amount of unlabeled data beginning with a small set of seed words. Using these 
learned features, we develop an unsupervised naive Bayes classifier without any 
labeled documents that performs better than supervised learning in small number of 
features cases. In above discussion, we analyze the reason of the matter. In the future 
work, we will study how to acquire more Domain Associated Words (DAWs) and 
apply them in text categorization, and hope to improve the performance of NB_FLB 
classifier in larger number of features cases. 
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Abstract. The Naive Bayes classifier exists in different versions. One 
version, called multi-variate Bernoulli or binary independence model, 
uses binary word occurrence vectors, while the multinomial model uses 
word frequency counts. Many publications cite this difference as the 
main reason for the superior performance of the multinomial Naive Bayes 
classifier. We argue that this is not true. We show that when all word fre- 
quency information is eliminated from the document vectors, the multi- 
nomial Naive Bayes model performs even better. Moreover, we argue 
that the main reason for the difference in performance is the way that 
negative evidence, i.e. evidence from words that do not occur in a doc- 
ument, is incorporated in the model. Therefore, this paper aims at a 
better understanding and a clarification of the difference between the 
two probabilistic models of Naive Bayes. 



1 Introduction 

Naive Bayes is a popular machine learning technique for text classification be- 
cause it performs well despite its simplicity [1, 2]. Naive Bayes comes in different 
versions, depending on how text documents are represented [3,4]. In one version, 
a document is represented as a binary vector of word occurrences: Each compo- 
nent of a document vector corresponds to a word from a fixed vocabulary, and 
the component is one if the word occurs in the document and zero otherwise. 
This is called multi-variate Bernoulli model (aka binary independence model) 
because a document vector can be regarded as the outcome of multiple indepen- 
dent Bernoulli experiments. In another version, a document is represented as a 
vector of word counts: Each component indicates the number of occurrences of 
the corresponding word in the document. This is called multinomial Naive Bayes 
model because the probability of a document vector is given by a multinomial 
distribution. 

Previous studies have found that the multinomial version of Naive Bayes 
usually gives higher classification accuracy than the multi-variate Bernoulli ver- 
sion [3,4]. Many people who use multinomial Naive Bayes, even the authors of 
these studies, attribute its superior performance to the fact that the document 
representation captures word frequency information in documents, whereas the 
multi-variate Bernoulli version does not. 
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This paper argues that word frequency information is not what makes the 
multinomial Naive Bayes classifier superior in the first place. We show that 
removal of the word frequency information results in increased, rather than de- 
creased performance. Furthermore, we argue that the difference in performance 
between the two versions of Naive Bayes should be attributed to the way the 
two models treat negative evidence, i.e. evidence from words that do occur 
in a document. 

The rest of the paper is structured as follows. In Sect. 2 we review the two 
versions of the Naive Bayes classifier. Sections 3 and 4 are concerned with the 
role that word frequency information and negative evidence play in the Naive 
Bayes models. In Sect. 5 we discuss our results and show relations to other work. 
Finally, in Sect. 6 we draw some conclusions. 



2 Naive Bayes 



All Naive Bayes classifiers are based on the assumption that documents are 
generated by a parametric mixture model, where the mixture components cor- 
respond to the possible classes [3]. A document is created by choosing a class 
and then letting the corresponding mixture component create the document 
according to its parameters. The total probability, or likelihood, of d is 

|C| 

= ^P{cj)p{d\cj) ( 1 ) 

i=i 



where p{cj) is the prior probability that class Cj is chosen, and p{d\cj) is the 
probability that the mixture component Cj generates document d. Using Bayes’ 
rule, the model can be inverted to get the posterior probability that d was 
generated by the mixture component cj: 



p{cj\d) 



p{cj)p{d\cj) 

p{d) 



(2) 



To classify a document, choose the class with maximum posterior probability, 
given the document: 

c*{d) = aigma,xp{cj)p{d\cj) (3) 

j 

Note that we have ignored p{d) in (3) because it does not depend on the class. 
The prior probabilities p{cj) are estimated from a training corpus by counting 
the number of training documents in class Cj and dividing by the total number 
of training documents. 

The distribution of documents in each class, p{d\cj), cannot be estimated 
directly. Rather, it is assumed that documents are composed from smaller units, 
usually words or word stems. To make the estimation of parameters tractable, 
we make the Naive Bayes assumption: that the basic units are distributed inde- 
pendently. The different versions of Naive Bayes make different assumptions to 
model the composition of documents from the basic units. 
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2.1 Multi-variate Bernoulli Model 

In this version, each word Wt in a fixed vocabulary V is modeled by a random 
variable Wt G {0, 1} with distribution p{wt\cj) = p{Wt = l|cj). Wt is included 
in a document if and only if the outcome of Wt is one. Thus a document is 
represented as a binary vector d = {xt)t=i...\v\ - The distribution of documents, 
assuming independence, is then given by the formula: 

1^1 

p{d\cj) = W_p{wt\cjY^{l - p{wt\cj))^'^~^^'^ (4) 

i=l 

The parameters p{wt\cj) are estimated from labeled training documents using 
maximum likelihood estimation with a Laplacean prior, as the fraction of training 
documents in class Cj that contain the word Wt- 

where Bjt is the number of training documents in Cj that contain Wt- 



2.2 Multinomial Model 



In the multinomial version, a document d is modeled as the outcome of |ii| 
independent trials on a single random variable W that takes on values Wt € V 
with probabilities p{wt\cj) and p(wt\cj) = 1. Each trial with outcome Wt 
yields an independent occurrence of Wt in d. Thus a document is represented as 
a vector of word counts d = {xt)t=i...\v\ where each Xt is the number of trials 
with outcome Wt, i.e. the number of times Wt occurs in d. The probability of d 
is given by the multinomial distribution: 

p{d\c,)=p{\d\)\d\lY[ P^^^^f^' ( 6 ) 

t=i 



Here we assume that the length of a document is chosen according to some 
length distribution, independently of the class. 

The parameters p{wt\cj) are estimated by counting the occurrences of Wt in 
all training documents in Cj, using a Laplacean prior: 



p{wt\cj) 



1 + Njt 

\V\ + N, 



(7) 



where Njt is the number of occurrences of Wt in the training documents in Cj 
and Nj is the total number of word occurrences in Cj. 



3 Word Frequency Information 

In [3] it was found that the multinomial model outperformed the multi-variate 
Bernoulli model consistently on five text categorization datasets, especially for 
larger vocabulary sizes. In [4] it was found that the multinomial model performed 
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best among four probabilistic models, including the multi- variate Bernoulli model, 
on three text categorization datasets. Both studies point out as the main dis- 
tinguishing factor of the two models the fact that the multinomial model takes 
the frequency of appearance of a word into account. Although [4] also study 
the different forms of independence assumptions the two models make, many 
authors refer only to this point and attribute the superior performance of the 
multinomial Naive Bayes classifier solely to the word frequency information. 

We argue that capturing word frequency information is not the main factor 
that distinguishes the multinomial model from the multi- variate Bernoulli model. 
In this section we show that word frequency information does not account for the 
superior performance of the multinomial model, while the next section suggests 
that the way in which negative evidence is incorporated is more important. 

We perform classification experiments on three publicly available datasets: 
20 Newsgroups, WebKB and ling-spam (see the appendix for a description). To 
see the influence of term frequency on classification, we apply a simple transfor- 
mation to the documents in the training and test set: = minja;*, 1}. This has 

the effect of replacing multiple occurrences of the same word in a document with 
a single occurrence. Figure 3 shows classification accuracy on the 20 Newsgroups 
dataset. Figure 2 shows classification accuracy on the ling-spam corpus. Figure 3 
shows classification accuracy on the WebKB dataset. In all three experiments 
we used a multinomial Naive Bayes classifier, applied to the raw data and to 




Vocabulary Size 



Fig. 1. Classification accuracy for multinomial Naive Bayes on the 20 Newsgroups 
dataset with raw and transformed word counts. Results are averaged over five cross- 
validation trials, with small error bars shown. The number of selected features varies 
from 20 to 20000 





478 



K.-M. Schneider 




Vocabulary Size 



Fig. 2. Classification accnracy for mnltinomial Naive Bayes on the ling-spam corpns 
with raw and transformed word counts. Resnlts are averaged over ten cross-validation 
trials 




Vocabulary Size 



Fig. 3. Classification accuracy for multinomial Naive Bayes on the WebKB dataset 
with raw and transformed word counts. Resnlts are averaged over ten cross-validation 
trials with random splits, nsing 70% of the data for training and 30% for testing. Small 
error bars are shown 
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the transformed documents. We reduced the vocabulary size by selecting the 
words with the highest mutual information [5] with the class variable (see [3] 
for details). Using the transformed word counts (i.e. with the word frequency 
removed) leads to higher classification accuracy on all three datasets. For Web- 
KB the improvement is significant up to 5000 words at the 0.99 confidence level 
using a two-tailed paired t-test. For the other datasets, the improvement is sig- 
nificant over the full range at the 0.99 confidence level. The difference is more 
pronounced for smaller vocabulary sizes. 



4 Negative Evidence 

Why does the multinomial Naive Bayes model perform better than the multi- 
variate Bernoulli model? We use the ling-spam corpus as a case study. To get a 
clue, we plot separate recall curves for the ling class and spam class (Fig. 4 and 
5). The multi- variate Bernoulli model has high ling recall but poor spam recall, 
whereas recall in the multinomial model is much more balanced. This bias in 
recall is somehow caused by the particular properties of the ling-spam corpus. 
Table 1 shows some statistics of the ling-spam corpus. Note that 8.3% of the 
words do occur in ling documents while 81.2% of the words do occur in 
spam documents. 




Fig. 4. Ling recall for multi-variate Bernoulli and multinomial Naive Bayes on the 
ling-spam corpus, with 10-fold cross validation 

Consider the multi- variate Bernoulli distribution (4): Each word in the vocab- 
ulary contributes to the probability of a document in one of two ways, depending 
on whether it occurs in the document or not: 
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Fig. 5. Spam recall for multi-variate Bernoulli and multinomial Naive Bayes on the 
ling-spam corpus, with 10-fold cross validation 

Table 1. Statistics of the ling-spam corpus 





Total 


Ling 


Spam 


Documents 


2893 


2412 (83.4%) 


481 (16.6%) 


Vocabulary 


59,829 


54,860 (91.7%) 


11,250 (18.8%) 



— a word that occurs in the document (positive evidence) contributes p{wt\cj). 

— a word that does not occur in the document (negative evidence) contributes 
l-p{wt\cj). 

Table 2 shows the average distribution of words in ling-spam documents. 
On average, only 226.5 distinct words (0.38% of the total vocabulary) occur in 
a document. Each word occurs in 11 documents on average. If only the 5000 
words with highest mutual information with the class variable are used, each 
document contains 138.5 words, or 2.77% of the vocabulary, on average, and the 
average number of documents containing a word rises to 80.2. If we reduce the 
vocabulary size to 500 words, the percentage of words that occur in a document 
is further increased to 8.8% (44 out of 500 words). However, on average 

This observation implies that the probability of a document is mostly deter- 
mined on the basis of words that do not occur in the document, i.e. 

. Table 3 shows 

the probability of an empty document according to the multi-variate Bernoulli 
distribution in the ling-spam corpus. An empty document is always classified as a 
ling document. This can be explained as follows: First, note that there are much 
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Table 2. Average distribution of vocabulary words in the ling-spam corpus for three 
different vocabulary sizes. Shown are the average number of distinct words per docu- 
ment and the average number off documents in which a word occurs 



Vocabulary 




Total 




Ling 




Spam 




Words 


Documents 


Words 


Documents 


Words 


Documents 


Full 


226.5 


11.0 


226.9 


9.1 


224.5 


1.8 


MI 5000 


138.5 


80.2 


133.8 


64.5 


162.5 


15.6 


MI 500 


44.0 


254.5 


39.6 


190.9 


66.2 


63.7 



Table 3. Probability of an empty document in the ling-spam corpus, for three different 
vocabulary sizes. Parameters are estimated according to (5) using the full corpus 



Vocabulary 


Total 


Ling 


Spam 


Full 


3.21e-137 


1.29e-131 


5.2e-174 


MI 5000 


6.44e-78 


8.4e-76 


1.45e-96 


MI 500 


5.21e-24 


1.41e-22 


3.59e-37 



more ling words than spam words (cf. Table 1). However, the number of distinct 
words in ling documents is not higher than in spam documents (cf. Table 2), 
especially when the full vocabulary is not used. Therefore, the probability of 
each word in the ling class is lower than in the spam class. According to Table 2, 
when a document is classified most of the words are counted as negative evidence 
(in an empty document, all words are counted as negative evidence). Therefore 
their contribution to the probability of a document is higher in the ling class 
than in the spam class, because their conditional probability is lower in the ling 
class. Note that the impact of the prior probabilities in (4) is negligible. 

The impact of negative evidence on classification can be visualized using the 
of a word for each of the two classes. In Fig. 6 and 7 we 
plot the weight of evidence of a word for the spam class in the ling-spam corpus 
against the weight of evidence for the ling class when the word is in the 
document, for each of the 500, respectively 5000, words with highest mutual 
information. This plot visualizes how much weight the multi- variate Bernoulli 
model gives to each word as an indicator for the class of a document when that 
word is in the document. One can see that all of the selected words occur 
more frequently in one class than the other (all words are either above or below 
the diagonal), but a larger number of words is used as evidence for the ling class 
when they do not appear in a document. 

5 Discussion 

In [4] it was shown that the multinomial model defined in (6) is a modified 
Naive Bayes Poisson model that assumes independence of document length and 
document class. In the Naive Bayes Poisson model, each word Wt is modeled as a 
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Fig. 6. Weight of evidence of words that do not appear in a docnment for the 500 
words in the ling-spam corpns with highest mntual information with the class variable. 
Lower valnes mean stronger evidence. For example, a word in the upper left region of 
the scatter plot means evidence for the ling class when that word does not appear in 
a document. Probabilities are estimated according to (5) using the full corpus 

random variable Wt that takes on non-negative values representing the number 
of occurrences in a document, thus incorporating word frequencies directly. The 
variables Wt have a Poisson distribution, and the Naive Bayes Poisson model 
assumes independence between the variables Wt- Note that in this model the 
length of a document is dependent on the class. However, in [4] it was found that 
the Poisson model was not superior to the multinomial model. The multinomial 
Naive Bayes model also assumes that the word counts in a document vector have 
a Poisson distribution. 

Why is the performance of the multinomial Naive Bayes classifier improved 
when the word frequency information is eliminated in the documents? In [6] and 
[7] the distribution of terms in documents was studied. It was found that terms 
often exhibit : the probability that a term appears a second time in 

a document is much larger than the probability that it appears at all in a doc- 
ument. The Poisson distribution does not fit this behaviour well. In [6,7] more 
sophisticated distributions (mixtures of Poisson distributions) were employed to 
model the distribution of terms in documents more accurately. However, in [8] 
it was found that changing the word counts in the document vectors with a 
simple transformation like a;) = log(d -I- Xt) is sufficient to improve the perfor- 
mance of the multinomial Naive Bayes classifier. This transformation has the 
effect of pushing down larger word counts, thus giving documents with multiple 
occurrences of the same word a higher probability in the multinomial model. 
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- log(l - P{wt\Ling)) 

Fig. 7. Weight of evidence of words that do not appear in a document for the 5000 
words in the ling-spam corpus with highest mutual information with the class variable 



The transformation that we used in our experiments (Sect. 3) eliminates the 
word frequency information in the document vectors completely, reducing it to 
binary word occurrence information, while also improving classification accuracy. 

Then what is the difference between the multi-variate Bernoulli and the multi- 
nomial Naive Bayes classifier? The multi-variate Bernoulli distribution (4) gives 
equal weight to positive and negative evidence, whereas in the multinomial model 
(6) each word Wt € V contributes to p{d\cj) according to the number of times Wt 
occurs in d. In [3] it was noted that words that do not occur in d contribute to 
p{d\cj) indirectly because the relative frequency of these words is encoded in the 
class-conditional probabilities p{wt\cj). When a word appears more frequently 
in the training documents, it gets a higher probability, and the probability of 
the other words will be lower. However, this impact of negative evidence is much 
lower than in the multi-variate Bernoulli model. 



6 Conclusions 

The multinomial Naive Bayes classifier outperforms the multi-variate Bernoulli 
model in the domain of text classification, not because it uses word frequency in- 
formation, but because of the different ways the two models incorporate negative 
evidence from documents, i.e. words that do not occur in a document. In fact, 
eliminating all word frequency information (by a simple transformation of the 
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document vectors) results in a classifier with significantly higher classification 
accuracy. 

In a case study we find that most of the evidence in the multi- variate Bernoulli 
model is actually negative evidence. In situations where the vocabulary is dis- 
tributed unevenly across different classes, the multi-variate Bernoulli model can 
be heavily biased towards one class because it gives too much weight to negative 
evidence, resulting in lower classification accuracy. 

The main goal of this work is not to improve the performance of the Naive 
Bayes classifier, but to contribute to a better understanding of the different 
versions of the Naive Bayes classifier. It is hoped that this will be beneficial also 
for other lines of research, e.g. for developing better feature selection techniques. 
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A Datasets 

The 20 Newsgroups dataset consists of 19,997 documents distributed evenly 
across 20 different newsgroups. It is available from http://people.csail.mit.edu/ 
people/jrennie/20Newsgroups/. We removed all newsgroup headers and used 
only words consisting of alphabetic characters as tokens, after applying a stoplist 
and converting to lower case. 

The ling-spam corpus consists of messages from a linguistics mailing list and 
spam messages [9]. It is available from the publications section of http://www. 
aueb.gr/users/ion/. The messages have been tokenized and lemmatized, with all 
attachments, HTML tags and E-mail headers (except the subject) removed. 

The WebKB dataset contains web pages gathered from computer science 
departments [10]. It is available from http://www.cs.cmu.edu/afs/cs.cmu.edu/ 
project/theo-20/www/data/. We use only the four most populous classes , 

, , and . We stripped all HTML tags and used only words 

and numbers as tokens, after converting to lower case. 
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